Skip to content

Home

LongParser

Privacy-first document intelligence engine for production RAG pipelines.

Parse PDFs, DOCX, PPTX, XLSX & CSV → validated, AI-ready chunks with HITL review.

CI   PyPI   Total Downloads   Monthly Downloads   Python   MIT License


Why LongParser?

Most RAG pipelines fail at the data layer. Hallucinations, missed tables, garbled equations, and unverified citations stem from poor document parsing — not from the LLM itself.

LongParser solves the input problem.

Feature LongParser
Multi-format extraction PDF, DOCX, PPTX, XLSX, CSV
Hybrid chunking (6 strategies)
HITL review workflow
3-layer memory chat
Built-in citation validation
LaTeX/equation parsing
LangChain & LlamaIndex ready
RTL language support
Docker-ready server

Quick Start

pip install longparser
from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
doc = pipeline.process("report.pdf")

print(f"Extracted {len(doc.blocks)} blocks, {len(doc.chunks)} chunks")

Architecture

graph LR
    A[Document] --> B[Extract]
    B --> C[Validate]
    C --> D[HITL Review]
    D --> E[Chunk]
    E --> F[Embed]
    F --> G[Index]
    G --> H[Chat Engine]

Installation

# Recommended — everything included (GPU/CPU both work)
pip install "longparser[gpu]"

# Core SDK only — minimal, no server
pip install longparser

Full installation guide — CPU-only, Docker, extras reference

Next Steps