Quickstart¶
Parse your first document in under 5 minutes.
1. Install¶
2. Parse a PDF¶
from longparser import DocumentPipeline, ProcessingConfig
# Create pipeline with defaults
pipeline = DocumentPipeline(ProcessingConfig())
# Parse a PDF
doc = pipeline.process("research_paper.pdf")
print(f"Pages: {len(doc.pages)}")
print(f"Blocks: {len(doc.blocks)}")
print(f"Chunks: {len(doc.chunks)}")
3. Inspect Chunks¶
for chunk in doc.chunks[:3]:
print(f"[{chunk.chunk_type}] tokens={chunk.token_count}")
print(chunk.text[:200])
print("---")
4. Use with LangChain¶
from longparser.integrations.langchain import LongParserLoader
loader = LongParserLoader("report.pdf")
documents = loader.load() # Returns List[Document]
5. Use with LlamaIndex¶
from longparser.integrations.llamaindex import LongParserReader
reader = LongParserReader()
nodes = reader.load_data(file="report.pdf")
6. Start the REST Server¶
# Set environment variables
cp .env.example .env
# Edit .env with your keys
# Run server
uvicorn longparser.server.app:app --reload
Then visit http://localhost:8000/docs for the Swagger UI.
Supported Formats¶
| Format | Extension | Notes |
|---|---|---|
.pdf |
OCR + table structure | |
| Word | .docx |
OMML equation injection |
| PowerPoint | .pptx |
Slide-by-slide chunking |
| Excel | .xlsx |
Sheet-aware table parsing |
| CSV | .csv |
Column-profile chunks |