Quickstart¶

Parse your first document in under 5 minutes.

1. Install¶

pip install longparser

2. Parse a PDF¶

from longparser import DocumentPipeline, ProcessingConfig

# Create pipeline with defaults
pipeline = DocumentPipeline(ProcessingConfig())

# Parse a PDF
doc = pipeline.process("research_paper.pdf")

print(f"Pages: {len(doc.pages)}")
print(f"Blocks: {len(doc.blocks)}")
print(f"Chunks: {len(doc.chunks)}")

3. Inspect Chunks¶

for chunk in doc.chunks[:3]:
    print(f"[{chunk.chunk_type}] tokens={chunk.token_count}")
    print(chunk.text[:200])
    print("---")

4. Use with LangChain¶

from longparser.integrations.langchain import LongParserLoader

loader = LongParserLoader("report.pdf")
documents = loader.load()  # Returns List[Document]

5. Use with LlamaIndex¶

from longparser.integrations.llamaindex import LongParserReader

reader = LongParserReader()
nodes = reader.load_data(file="report.pdf")

6. Start the REST Server¶

# Set environment variables
cp .env.example .env
# Edit .env with your keys

# Run server
uvicorn longparser.server.app:app --reload

Then visit http://localhost:8000/docs for the Swagger UI.

Supported Formats¶

Format	Extension	Notes
PDF	`.pdf`	OCR + table structure
Word	`.docx`	OMML equation injection
PowerPoint	`.pptx`	Slide-by-slide chunking
Excel	`.xlsx`	Sheet-aware table parsing
CSV	`.csv`	Column-profile chunks