Pipeline Reference¶
The DocumentPipeline is the main entry point for LongParser's extraction pipeline.
DocumentPipeline¶
from longparser import DocumentPipeline, ProcessingConfig
pipeline = DocumentPipeline(ProcessingConfig())
result = pipeline.process_file("document.pdf")
Constructor¶
| Parameter | Type | Description |
|---|---|---|
config |
ProcessingConfig \| None |
Extraction and chunking configuration (uses defaults if None) |
Methods¶
process_file(file_path)¶
Process a document end-to-end through Extract → Validate → Chunk.
Returns: PipelineResult with .document and .chunks populated.
process(request)¶
Process a document from a JobRequest object.
from longparser import JobRequest
request = JobRequest(file_path="report.pdf")
result = pipeline.process(request)
process_batch(file_paths)¶
Process multiple documents sequentially.
ProcessingConfig¶
from longparser import ProcessingConfig
config = ProcessingConfig(
do_ocr=True,
do_table_structure=True,
formula_mode="smart", # fast | smart | full
formula_ocr=True,
export_images=False,
force_full_page_ocr=False,
max_pages=None,
redact_pii=False, # Enable PII redaction
use_ner_redaction=False, # Enable spaCy NER second pass
ner_model="en_core_web_sm",
)
| Field | Type | Default | Description |
|---|---|---|---|
do_ocr |
bool |
True |
Enable Tesseract OCR |
do_table_structure |
bool |
True |
Enable TableFormer |
formula_mode |
str |
"smart" |
Equation parsing mode |
formula_ocr |
bool |
True |
Enable LaTeX OCR |
export_images |
bool |
False |
Export figure images |
force_full_page_ocr |
bool |
False |
OCR entire page |
max_pages |
int \| None |
None |
Page cap |
redact_pii |
bool |
False |
Enable PII redaction (Regex + Luhn) |
use_ner_redaction |
bool |
False |
Enable spaCy NER for contextual PII |
ner_model |
str |
"en_core_web_sm" |
spaCy model for NER |