Skip to content

Document Parsing

LongParser uses Docling with Tesseract CLI OCR as its extraction engine — supporting PDF, DOCX, PPTX, XLSX, and CSV.

Supported Formats

Format Capabilities
PDF Layout analysis, OCR, table structure, equation detection
DOCX OMML equations → LaTeX injection
PPTX Slide-by-slide extraction with hierarchy
XLSX Sheet-aware table chunking with column profiles
CSV Column-type inference, schema chunks

Basic Usage

from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
doc = pipeline.process("paper.pdf")

Formula Modes

LongParser has three modes for equation handling:

config = ProcessingConfig(formula_mode="smart")
# fast   — Unicode normalization only (fastest)
# smart  — BBox crop → pix2tex OCR for detected formulas
# full   — Docling enrichment enabled for all formulas

Accessing Results

# Pages
for page in doc.pages:
    print(f"Page {page.page_number}: {page.width}x{page.height}")

# Blocks (semantic units)
for block in doc.blocks:
    print(f"[{block.type}] p={block.provenance.page_number}: {block.text[:80]}")

# Chunks (RAG-ready)
for chunk in doc.chunks:
    print(f"{chunk.chunk_type} | {chunk.token_count} tokens | pages={chunk.page_numbers}")

Block Types

Type Description
heading Section header (with level)
paragraph Body text
table Structured table data
list_item Bullet or numbered list item
equation Mathematical formula
figure Image or diagram
caption Figure/table caption
header Page header
footer Page footer

RTL Language Support

LongParser automatically detects right-to-left scripts (Arabic, Hebrew, etc.) and applies correct text ordering:

from longparser.utils.rtl_detector import detect_rtl

is_rtl = detect_rtl("مرحبا بالعالم")  # True