Complete Workflow¶
Full RAG regression testing workflow from auto-capture to regression detection.
Demo¶

Overview¶
This demo demonstrates the complete LongProbe workflow in action. Perfect for:
- ✅ Understanding how LongProbe works end-to-end
- ✅ First-time users learning the tool
- ✅ Seeing all features in one comprehensive demo
- ✅ Understanding the regression detection process
What It Shows¶
1. Test Questions¶
Shows the questions being tested upfront:
- Sample questions from your golden set
- Clear visibility into what's being validated
- Understanding the scope of testing
2. Workflow Progress (Animated)¶
A single, continuously updating progress panel showing:
- Step 1: Start Mock RAG API - Simulates your RAG backend
- Step 2: Build Golden Set - Auto-capture questions and expected chunks
- Step 3: Run Initial Check - Test retrieval quality (100% pass)
- Step 4: Save Baseline - Store results for future comparison
- Step 5: Break API - Simulate a regression (remove a document)
- Step 6: Detect Regression - Catch the issue automatically
3. Retrieval Example¶
Shows what the API returns:
- Actual question being tested
- Retrieved chunks with similarity scores
- Understanding how retrieval works
4. Regression Detection¶
Shows exactly what broke:
- Which question failed
- What chunk went missing
- Clear explanation of the regression
5. Results Table¶
Provides before/after comparison:
- Overall recall (100% → 86.7%)
- Pass rate (100% → 60%)
- Number of failed tests (0 → 2)
Use Case¶
Scenario: You're building a RAG application and want to ensure changes don't break retrieval quality.
Workflow:
- Capture baseline - Auto-generate golden questions from your RAG API
- Run tests - Verify everything works (100% pass)
- Save baseline - Store the perfect state
- Make changes - Refactor, upgrade, or modify your RAG pipeline
- Detect regressions - LongProbe catches issues automatically
- Fix and verify - Address issues and re-run tests
Code Example¶
The demo script shows the complete workflow:
from longprobe import LongProbe
from longprobe.adapters import HttpAdapter
from longprobe.core.golden import GoldenSet, GoldenQuestion
# 1. Configure HTTP adapter for your RAG API
adapter = HttpAdapter(config=http_config)
# 2. Auto-capture golden questions
golden_questions = []
for question_text, tags in test_questions:
results = adapter.retrieve(query=question_text, top_k=3)
required_chunks = [r["text"] for r in results]
golden_questions.append(GoldenQuestion(
question=question_text,
required_chunks=required_chunks,
tags=tags
))
golden_set = GoldenSet(questions=golden_questions)
golden_set.to_yaml("goldens.yaml")
# 3. Run initial check
probe = LongProbe(adapter=adapter, goldens_path="goldens.yaml")
report = probe.run()
print(f"Recall: {report.overall_recall:.1%}")
# 4. Save baseline
probe.save_baseline(label="latest")
# 5. After making changes...
report_after = probe.run()
# 6. Detect regression
diff = probe.diff(baseline_label="latest")
if diff["regressions"]:
print(f"⚠️ {len(diff['regressions'])} regressions detected!")
Key Features Demonstrated¶
- 🎯 Auto-capture - Generate golden questions from your API
- 📊 Live progress - Single animated workflow panel
- 🔍 Retrieval visibility - See what chunks are retrieved
- 💾 Baseline tracking - Save and compare results
- 🚨 Regression detection - Catch issues automatically
- 📈 Before/after comparison - Clear metrics on what changed
Workflow Steps Explained¶
Step 1: Start Mock RAG API¶
Simulates your RAG backend with a simple HTTP server. In production, this would be your actual RAG API (LongTrainer, LangServe, etc.).
Step 2: Build Golden Set¶
Auto-captures golden questions by querying your API and saving the retrieved chunks as expected results. This is the "capture" workflow.
Step 3: Run Initial Check¶
Tests all questions against your RAG API and verifies 100% recall. This establishes your baseline.
Step 4: Save Baseline¶
Stores the perfect state in SQLite for future comparison. You can have multiple baselines (v1.0, v2.0, etc.).
Step 5: Break API¶
Simulates a regression by removing a document. In real scenarios, this could be: - Refactoring chunking strategy - Upgrading embedding model - Changing retrieval parameters - Accidentally deleting documents
Step 6: Detect Regression¶
Re-runs tests and compares against baseline. LongProbe automatically detects: - Which questions failed - What chunks went missing - Exact recall delta
When to Use¶
Use this workflow when you need:
- Complete understanding of LongProbe
- End-to-end regression testing
- Baseline tracking and comparison
- Confidence before deployment
CLI Equivalent¶
You can achieve similar results with CLI commands:
# Capture golden questions
longprobe capture --url http://localhost:8000/retrieve
# Run initial check
longprobe check --goldens goldens.yaml
# Save baseline
longprobe baseline save --label v1.0
# After changes, compare
longprobe diff --baseline v1.0
Next Steps¶
- Monitor RAG Quality - For detailed monitoring
- Detect Regressions - For CI/CD integration
- Python API Guide - Learn more about the API
- Baseline Management - Advanced baseline workflows