Hybrid Chunking¶

LongParser's HybridChunker combines 6 strategies to produce RAG-optimized chunks that respect document structure.

The 6 Strategies¶

#	Strategy	What it does
0	Equation detection	Pre-pass re-tags missed formula blocks
1	Structure-aware	Groups blocks by hierarchy path (sections)
2	Layout-aware	Classifies blocks by type before chunking
3	Token-window packing	Packs blocks up to `max_tokens` with overlap
4	Table-aware	Schema chunk + row-batched table chunks
5	List-aware	Keeps bullet groups together as single chunks

Usage¶

from longparser.chunkers import HybridChunker
from longparser.schemas import ChunkingConfig

config = ChunkingConfig(
    max_tokens=512,
    overlap_tokens=64,
    detect_equations=True,
    table_chunk_format="row_record",  # or "pipe"
    generate_schema_chunks=True,
    use_semantic_chunking=True,       # Split on semantic topic shifts
)

chunker = HybridChunker(config)
chunks = chunker.chunk(doc.blocks)

Table Chunking¶

Tables are chunked with: - Schema chunk — column names, types, null rates, sample rows - Data chunks — row-batched with header repeated per chunk - Wide-table banding — columns split into bands for very wide tables

# Table chunk formats
ChunkingConfig(table_chunk_format="row_record")
# Row 1: col_a=val; col_b=val; col_c=val

ChunkingConfig(table_chunk_format="pipe")
# col_a | col_b | col_c
# val_1 | val_2 | val_3

Chunk Structure¶

@dataclass
class Chunk:
    chunk_id: str           # UUID
    text: str               # Chunk content
    token_count: int        # Approximate token count
    chunk_type: str         # section | table | table_schema | list | equation
    section_path: list[str] # ["Introduction", "Methods"]
    page_numbers: list[int] # [1, 2]
    block_ids: list[str]    # Source block IDs for traceability
    metadata: dict          # chunk_type-specific metadata

Token Budget¶

Chunks respect a hard max_tokens ceiling. Equations are kept with their surrounding context using a glue heuristic:

If the next block is an equation AND the current window overflows, the last paragraph carries over into the new chunk (so the equation is never split from its context).

Semantic Chunking¶

When use_semantic_chunking=True, the chunker uses embedding similarity (default: all-MiniLM-L6-v2) to detect topic shifts within a section. Instead of splitting purely by token count, it finds natural breakpoints where the semantic content changes.

config = ChunkingConfig(
    use_semantic_chunking=True,
    semantic_threshold=0.3,              # Lower = more splits
    semantic_model="all-MiniLM-L6-v2",   # or "all-mpnet-base-v2"
)

The model is lazily loaded on first use — no memory cost if the feature is disabled.

Cross-Reference Resolution¶

When resolve_cross_references=True (default), the pipeline automatically links textual references to their target blocks:

Explicit references: "see Figure 3", "Table 2", "Section 3.1", "Appendix A" → linked via regex + dictionary lookup.
Implicit references: "the figure above", "the table below" → linked via spatial proximity in reading order.

Resolved links appear in chunk metadata:

{
  "cross_references": [
    {"label": "Figure 3", "target_block_id": "block-uuid-123"},
    {"label": "the table below", "target_block_id": "block-uuid-456", "resolution": "proximity"}
  ]
}

Quality Scoring¶

Each chunk receives a quality_score (0.0–1.0) based on:

Block confidence — OCR confidence from the extraction engine
Dictionary word coverage — percentage of words found in /usr/share/dict/words (penalizes garbled OCR)
Language ID confidence — fastText-based language detection score (low confidence = noise)

from longparser.chunkers.quality_scorer import score_chunks
scored = score_chunks(chunks, blocks)
print(scored[0].quality_score)  # 0.92

PII Redaction¶

When redact_pii=True in ProcessingConfig, the pipeline automatically masks sensitive data before any HITL review:

Pass 1 (always): Fast regex + Luhn checksum for Emails, Phones, SSNs, Credit Cards, IPs.
Pass 2 (optional): spaCy NER (use_ner_redaction=True) for names, organizations, and locations.

Original values are preserved in block.pii_redactions for authorized recovery.