Skip to content

Hybrid Chunking

LongParser's HybridChunker combines 6 strategies to produce RAG-optimized chunks that respect document structure.

The 6 Strategies

# Strategy What it does
0 Equation detection Pre-pass re-tags missed formula blocks
1 Structure-aware Groups blocks by hierarchy path (sections)
2 Layout-aware Classifies blocks by type before chunking
3 Token-window packing Packs blocks up to max_tokens with overlap
4 Table-aware Schema chunk + row-batched table chunks
5 List-aware Keeps bullet groups together as single chunks

Usage

from longparser.chunkers import HybridChunker
from longparser.schemas import ChunkingConfig

config = ChunkingConfig(
    max_tokens=512,
    overlap_tokens=64,
    detect_equations=True,
    table_chunk_format="row_record",  # or "pipe"
    generate_schema_chunks=True,
)

chunker = HybridChunker(config)
chunks = chunker.chunk(doc.blocks)

Table Chunking

Tables are chunked with: - Schema chunk — column names, types, null rates, sample rows - Data chunks — row-batched with header repeated per chunk - Wide-table banding — columns split into bands for very wide tables

# Table chunk formats
ChunkingConfig(table_chunk_format="row_record")
# Row 1: col_a=val; col_b=val; col_c=val

ChunkingConfig(table_chunk_format="pipe")
# col_a | col_b | col_c
# val_1 | val_2 | val_3

Chunk Structure

@dataclass
class Chunk:
    chunk_id: str           # UUID
    text: str               # Chunk content
    token_count: int        # Approximate token count
    chunk_type: str         # section | table | table_schema | list | equation
    section_path: list[str] # ["Introduction", "Methods"]
    page_numbers: list[int] # [1, 2]
    block_ids: list[str]    # Source block IDs for traceability
    metadata: dict          # chunk_type-specific metadata

Token Budget

Chunks respect a hard max_tokens ceiling. Equations are kept with their surrounding context using a glue heuristic:

  • If the next block is an equation AND the current window overflows, the last paragraph carries over into the new chunk (so the equation is never split from its context).