Skip to content

LongParser

Chunkers

ENDEVSOLS/LongParser

Chunkers Reference¶

HybridChunker¶

The main chunking engine combining 6 strategies for RAG-optimized output.

from longparser.chunkers import HybridChunker
from longparser.schemas import ChunkingConfig

chunker = HybridChunker(config=ChunkingConfig())
chunks = chunker.chunk(blocks)

Constructor¶

HybridChunker(config: ChunkingConfig | None = None)

If config is None, uses default ChunkingConfig().

Methods¶

`chunk(blocks)`¶

Run the full 6-strategy pipeline on a list of Block objects.

chunks: list[Chunk] = chunker.chunk(doc.blocks)

Steps:

Strategy 0 — Autonomous equation detection pre-pass
Filter — Remove header/footer blocks and separator-only blocks
Strategy 1 — Group blocks by hierarchy_path (sections)
Per section:
Strategy 4 — Table-aware chunking (schema + data chunks)
Strategy 5 — List-aware chunking (group bullet lists)
Strategy 3 — Token-window packing for remaining blocks
Merge small chunks below min_chunk_tokens
Overlap — Add token overlap between consecutive chunks

Configuration Reference¶

Config	Default	Description
`max_tokens`	`512`	Hard ceiling per chunk
`overlap_tokens`	`64`	Overlap between chunks
`detect_equations`	`True`	Run equation detection pre-pass
`exclude_headers_footers`	`True`	Remove page headers/footers
`generate_schema_chunks`	`True`	Add schema chunk per table
`table_chunk_format`	`"row_record"`	`pipe` or `row_record`
`wide_table_col_threshold`	`15`	Split columns into bands above this
`min_chunk_tokens`	`20`	Merge chunks smaller than this