How It Works¶
LongTracer uses a two-stage hybrid pipeline to verify each claim in an LLM response against source documents.
Pipeline Overview¶
LLM Response
│
▼
┌─────────────────┐
│ Claim Splitter │ Split response into individual sentences/claims
└────────┬────────┘
│ ["claim 1", "claim 2", ...]
▼
┌─────────────────────────────────────────────────────┐
│ For each claim: │
│ │
│ Step A: STS (Bi-Encoder) │
│ ┌──────────────────────────────────────────────┐ │
│ │ all-MiniLM-L6-v2 │ │
│ │ Encode claim + all source sentences │ │
│ │ Cosine similarity → best matching source │ │
│ │ O(N+M) — fast, ~10ms │ │
│ └──────────────────┬───────────────────────────┘ │
│ │ STS score ≥ 0.25? │
│ ▼ │
│ Step B: NLI (Cross-Encoder) [gated] │
│ ┌──────────────────────────────────────────────┐ │
│ │ nli-deberta-v3-xsmall │ │
│ │ (claim, best_source) → [contra, neutral, ent]│ │
│ │ O(1) per claim — accurate, ~150ms │ │
│ └──────────────────┬───────────────────────────┘ │
│ │ │
│ ▼ │
│ Verdict: supported / hallucination / neutral │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ Trust Score │ supported_claims / total_claims
└─────────────────┘
Stage 1: Claim Splitting¶
The LLM response is split into individual sentences using a regex-based splitter that:
- Protects decimal numbers (
98.5is not split at the.) - Protects abbreviations (
Dr.,Inc.,e.g.) from triggering splits - Filters out very short fragments (< 15 chars)
- Detects meta-statements — honest uncertainty phrases like "the documents do not contain..." (never flagged as hallucinations)
- Detects hallucination patterns — outside-knowledge signals like "based on my knowledge..." (flagged regardless of NLI)
Stage 2A: STS Evidence Selection¶
For each claim, the bi-encoder (all-MiniLM-L6-v2) computes cosine similarity between the claim embedding and every source sentence embedding.
- Complexity: O(N + M) where N = claim tokens, M = total source tokens
- Typical latency: < 10ms per claim
- Output: best-matching source sentence + similarity score
Gating: If the best STS score is below 0.25, NLI is skipped entirely. This avoids wasting compute on claims that have no plausible source match.
Stage 2B: NLI Verification¶
The cross-encoder (nli-deberta-v3-xsmall) takes the (claim, best_source_sentence) pair and outputs three scores:
| Label | Meaning |
|---|---|
entailment |
Source supports the claim |
neutral |
Source neither supports nor contradicts |
contradiction |
Source contradicts the claim → hallucination |
- Complexity: O(1) per claim (single pair)
- Typical latency: ~150ms per claim
Hallucination Detection Logic¶
A claim is flagged as a hallucination if any of these conditions are true:
contradiction_score > 0.5(NLI says it's contradicted)- Low STS score + hallucination pattern detected + NLI didn't rescue it
- Claim contains explicit outside-knowledge signals (
"based on my knowledge...")
Meta-statements are never flagged as hallucinations regardless of scores.
Trust Score¶
Where supported_claims = claims with entailment_score > threshold and no contradiction.
1.0= all claims supported0.0= no claims supported (or no sources provided)
Parallel Pipeline¶
When using ParallelPipeline, context relevance scoring runs in parallel with LLM generation:
Retrieve docs
│
├──────────────────────────────┐
│ │
▼ ▼
Context Relevance Scoring LLM Generation
(bi-encoder cosine sim) (your LLM call)
│ │
└──────────────┬───────────────┘
│
▼
Batch Claim Verification
│
▼
Verdict + Flags
This means relevance scoring adds zero latency to the pipeline — it runs while the LLM is thinking.