Skip to content

Quickstart

Parse your first document in under 5 minutes.

1. Install

pip install longparser

2. Parse a PDF

from longparser import DocumentPipeline, ProcessingConfig

# Create pipeline with defaults
pipeline = DocumentPipeline(ProcessingConfig())

# Parse a PDF
doc = pipeline.process("research_paper.pdf")

print(f"Pages: {len(doc.pages)}")
print(f"Blocks: {len(doc.blocks)}")
print(f"Chunks: {len(doc.chunks)}")

3. Inspect Chunks

for chunk in doc.chunks[:3]:
    print(f"[{chunk.chunk_type}] tokens={chunk.token_count}")
    print(chunk.text[:200])
    print("---")

4. Use with LangChain

from longparser.integrations.langchain import LongParserLoader

loader = LongParserLoader("report.pdf")
documents = loader.load()  # Returns List[Document]

5. Use with LlamaIndex

from longparser.integrations.llamaindex import LongParserReader

reader = LongParserReader()
nodes = reader.load_data(file="report.pdf")

6. Start the REST Server

# Set environment variables
cp .env.example .env
# Edit .env with your keys

# Run server
uvicorn longparser.server.app:app --reload

Then visit http://localhost:8000/docs for the Swagger UI.

Supported Formats

Format Extension Notes
PDF .pdf OCR + table structure
Word .docx OMML equation injection
PowerPoint .pptx Slide-by-slide chunking
Excel .xlsx Sheet-aware table parsing
CSV .csv Column-profile chunks