Home
Privacy-first document intelligence engine for production RAG pipelines.
Parse PDFs, DOCX, PPTX, XLSX & CSV → validated, AI-ready chunks with HITL review.
Why LongParser?¶
Most RAG pipelines fail at the data layer. Hallucinations, missed tables, garbled equations, and unverified citations stem from poor document parsing — not from the LLM itself.
LongParser solves the input problem.
| Feature | LongParser |
|---|---|
| Multi-format extraction | PDF, DOCX, PPTX, XLSX, CSV |
| Hybrid chunking (6 strategies) | ✅ |
| HITL review workflow | ✅ |
| 3-layer memory chat | ✅ |
| Built-in citation validation | ✅ |
| LaTeX/equation parsing | ✅ |
| LangChain & LlamaIndex ready | ✅ |
| RTL language support | ✅ |
| Docker-ready server | ✅ |
Quick Start¶
from longparser import DocumentPipeline, ProcessingConfig
pipeline = DocumentPipeline(ProcessingConfig())
doc = pipeline.process("report.pdf")
print(f"Extracted {len(doc.blocks)} blocks, {len(doc.chunks)} chunks")
Architecture¶
graph LR
A[Document] --> B[Extract]
B --> C[Validate]
C --> D[HITL Review]
D --> E[Chunk]
E --> F[Embed]
F --> G[Index]
G --> H[Chat Engine]
Installation¶
# Recommended — everything included (GPU/CPU both work)
pip install "longparser[gpu]"
# Core SDK only — minimal, no server
pip install longparser
→ Full installation guide — CPU-only, Docker, extras reference
Next Steps¶
- Installation Guide — detailed setup with virtual environments
- Quickstart — parse your first document in 5 minutes
- Configuration — environment variables and tuning