Extractors Reference¶
BaseExtractor¶
Abstract base class for all document extractors.
All extractors implement:
DoclingExtractor¶
Production extractor using Docling with Tesseract CLI OCR.
from longparser.extractors.docling_extractor import DoclingExtractor
extractor = DoclingExtractor(
tesseract_lang=["eng"],
tessdata_path=None, # uses system default
force_full_page_ocr=False,
)
doc = extractor.extract("report.pdf", ProcessingConfig())
Constructor Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
tesseract_lang |
list[str] |
["eng"] |
OCR language codes |
tessdata_path |
str \| None |
None |
Path to tessdata directory |
force_full_page_ocr |
bool |
False |
OCR all pages regardless of embedded text |
Formula Modes¶
| Mode | Speed | Quality | Description |
|---|---|---|---|
fast |
⚡⚡⚡ | ★★☆ | Unicode normalization only |
smart |
⚡⚡☆ | ★★★ | BBox crop → pix2tex for FORMULA blocks |
full |
⚡☆☆ | ★★★ | Docling enrichment enabled for all pages |
Supported Formats¶
| Extension | Notes |
|---|---|
.pdf |
Layout analysis, OCR, table structure, optional formula OCR |
.docx |
OMML → LaTeX injection via python-docx |
.pptx |
Slide-by-slide, python-pptx indent levels |
.xlsx |
Sheet-aware, ExcelFormat option |
.csv |
Column-type inference, CsvFormat option |
LaTeXOCR¶
Optional equation OCR backend.
from longparser.extractors.latex_ocr import LaTeXOCR
ocr = LaTeXOCR(backend="pix2tex")
latex = ocr.recognize(pil_image) # Returns LaTeX string
Note
Requires pip install "longparser[latex-ocr]" (pix2tex).