Skip to content

LongParser

Pipeline

ENDEVSOLS/LongParser

Pipeline Reference¶

The DocumentPipeline is the main entry point for LongParser's extraction pipeline.

DocumentPipeline¶

from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(config=ProcessingConfig())
doc = pipeline.process("document.pdf")

Constructor¶

DocumentPipeline(config: ProcessingConfig)

Parameter	Type	Description
`config`	`ProcessingConfig`	Extraction and chunking configuration

Methods¶

`process(file_path)`¶

Process a document end-to-end through Extract → Validate → Chunk.

doc = pipeline.process("report.pdf")
# Returns: longparser.schemas.Document

Returns: Document with .pages, .blocks, .chunks populated.

`process_batch(file_paths)`¶

Process multiple documents sequentially.

docs = pipeline.process_batch(["a.pdf", "b.docx", "c.pptx"])

ProcessingConfig¶

from longparser import ProcessingConfig

config = ProcessingConfig(
    do_ocr=True,
    do_table_structure=True,
    formula_mode="smart",   # fast | smart | full
    formula_ocr=True,
    export_images=False,
    force_full_page_ocr=False,
    max_pages=None,
)

Field	Type	Default	Description
`do_ocr`	`bool`	`True`	Enable Tesseract OCR
`do_table_structure`	`bool`	`True`	Enable TableFormer
`formula_mode`	`str`	`"smart"`	Equation parsing mode
`formula_ocr`	`bool`	`True`	Enable LaTeX OCR
`export_images`	`bool`	`False`	Export figure images
`force_full_page_ocr`	`bool`	`False`	OCR entire page
`max_pages`	`int \\| None`	`None`	Page cap