Home

Privacy-first document intelligence engine for production RAG pipelines.

Parse PDFs, DOCX, PPTX, XLSX & CSV → validated, AI-ready chunks with HITL review.

Why LongParser?¶

Most RAG pipelines fail at the data layer. Hallucinations, missed tables, garbled equations, and unverified citations stem from poor document parsing — not from the LLM itself.

LongParser solves the input problem.

Feature	LongParser
Multi-format extraction	PDF, DOCX, PPTX, XLSX, CSV
Hybrid chunking (6 strategies)	✅
HITL review workflow	✅
3-layer memory chat	✅
Built-in citation validation	✅
LaTeX/equation parsing	✅
LangChain & LlamaIndex ready	✅
RTL language support	✅
Docker-ready server	✅

Quick Start¶

pip install longparser

from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
doc = pipeline.process("report.pdf")

print(f"Extracted {len(doc.blocks)} blocks, {len(doc.chunks)} chunks")

Architecture¶

graph LR
    A[Document] --> B[Extract]
    B --> C[Validate]
    C --> D[HITL Review]
    D --> E[Chunk]
    E --> F[Embed]
    F --> G[Index]
    G --> H[Chat Engine]

Installation¶

# Recommended — everything included (GPU/CPU both work)
pip install "longparser[gpu]"

# Core SDK only — minimal, no server
pip install longparser

→ Full installation guide — CPU-only, Docker, extras reference

Next Steps¶

Installation Guide — detailed setup with virtual environments
Quickstart — parse your first document in 5 minutes
Configuration — environment variables and tuning