Hybrid Chunking¶
LongParser's HybridChunker combines 6 strategies to produce RAG-optimized chunks that respect document structure.
The 6 Strategies¶
| # | Strategy | What it does |
|---|---|---|
| 0 | Equation detection | Pre-pass re-tags missed formula blocks |
| 1 | Structure-aware | Groups blocks by hierarchy path (sections) |
| 2 | Layout-aware | Classifies blocks by type before chunking |
| 3 | Token-window packing | Packs blocks up to max_tokens with overlap |
| 4 | Table-aware | Schema chunk + row-batched table chunks |
| 5 | List-aware | Keeps bullet groups together as single chunks |
Usage¶
from longparser.chunkers import HybridChunker
from longparser.schemas import ChunkingConfig
config = ChunkingConfig(
max_tokens=512,
overlap_tokens=64,
detect_equations=True,
table_chunk_format="row_record", # or "pipe"
generate_schema_chunks=True,
)
chunker = HybridChunker(config)
chunks = chunker.chunk(doc.blocks)
Table Chunking¶
Tables are chunked with: - Schema chunk — column names, types, null rates, sample rows - Data chunks — row-batched with header repeated per chunk - Wide-table banding — columns split into bands for very wide tables
# Table chunk formats
ChunkingConfig(table_chunk_format="row_record")
# Row 1: col_a=val; col_b=val; col_c=val
ChunkingConfig(table_chunk_format="pipe")
# col_a | col_b | col_c
# val_1 | val_2 | val_3
Chunk Structure¶
@dataclass
class Chunk:
chunk_id: str # UUID
text: str # Chunk content
token_count: int # Approximate token count
chunk_type: str # section | table | table_schema | list | equation
section_path: list[str] # ["Introduction", "Methods"]
page_numbers: list[int] # [1, 2]
block_ids: list[str] # Source block IDs for traceability
metadata: dict # chunk_type-specific metadata
Token Budget¶
Chunks respect a hard max_tokens ceiling. Equations are kept with their surrounding context using a glue heuristic:
- If the next block is an equation AND the current window overflows, the last paragraph carries over into the new chunk (so the equation is never split from its context).