Skip to content

LongParser

Pipeline

ENDEVSOLS/LongParser

Pipeline Reference¶

The DocumentPipeline is the main entry point for LongParser's extraction pipeline.

DocumentPipeline¶

from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
result = pipeline.process_file("document.pdf")

Constructor¶

DocumentPipeline(config: ProcessingConfig | None = None)

Parameter	Type	Description
`config`	`ProcessingConfig \\| None`	Extraction and chunking configuration (uses defaults if `None`)

Methods¶

`process_file(file_path)`¶

Process a document end-to-end through Extract → Validate → Chunk.

result = pipeline.process_file("report.pdf")
# Returns: longparser.pipeline.PipelineResult

Returns: PipelineResult with .document and .chunks populated.

`process(request)`¶

Process a document from a JobRequest object.

from longparser import JobRequest
request = JobRequest(file_path="report.pdf")
result = pipeline.process(request)

`process_batch(file_paths)`¶

Process multiple documents sequentially.

results = pipeline.process_batch(["a.pdf", "b.docx", "c.pptx"])

ProcessingConfig¶

from longparser import ProcessingConfig

config = ProcessingConfig(
    do_ocr=True,
    do_table_structure=True,
    formula_mode="smart",   # fast | smart | full
    formula_ocr=True,
    export_images=False,
    force_full_page_ocr=False,
    max_pages=None,
    redact_pii=False,       # Enable PII redaction
    use_ner_redaction=False, # Enable spaCy NER second pass
    ner_model="en_core_web_sm",
)

Field	Type	Default	Description
`do_ocr`	`bool`	`True`	Enable Tesseract OCR
`do_table_structure`	`bool`	`True`	Enable TableFormer
`formula_mode`	`str`	`"smart"`	Equation parsing mode
`formula_ocr`	`bool`	`True`	Enable LaTeX OCR
`export_images`	`bool`	`False`	Export figure images
`force_full_page_ocr`	`bool`	`False`	OCR entire page
`max_pages`	`int \\| None`	`None`	Page cap
`redact_pii`	`bool`	`False`	Enable PII redaction (Regex + Luhn)
`use_ner_redaction`	`bool`	`False`	Enable spaCy NER for contextual PII
`ner_model`	`str`	`"en_core_web_sm"`	spaCy model for NER