Grounding Verification¶
Grounding verification ensures that answers are based on retrieved evidence, not the LLM's parametric knowledge.
The Problem¶
LLMs have extensive world knowledge from training. When evaluating RAG:
- An LLM might answer correctly from memory, not from context
- This inflates evaluation metrics
- You don't actually know if retrieval is working
Grounding Modes¶
aletheia evaluate-ragas \
--knowledge-graph my_graph \
--questions questions.json \
--grounding-mode <mode>
| Mode | Behavior | Use When |
|---|---|---|
strict | Reject ungrounded answers | Default, recommended |
lenient | Warn but include all | Exploring retrieval |
off | No verification | Backward compatibility |
How It Works¶
1. Evidence Presentation¶
Context is presented as numbered evidence units:
EVIDENCE:
[E1] Hamas is designated as an FTO by the US State Department
[E2] Hamas has the alias Islamic Resistance Movement
[E3] Hezbollah is designated by UK and US authorities
2. Citation Requirement¶
The LLM must cite evidence:
{
"answer": "Hamas is designated as an FTO",
"evidence_ids": ["E1"],
"reasoning": "E1 directly states Hamas is designated as FTO by US State Department"
}
3. Verification¶
Aletheia verifies: - Are cited evidence IDs valid? - Are entities in the answer mentioned in cited evidence? - Is INSUFFICIENT_CONTEXT used appropriately?
Rejection Reasons¶
| Reason | Description | Example |
|---|---|---|
parse_error | JSON response malformed | Invalid JSON syntax |
answer_without_citations | Answer but no evidence cited | Missing evidence_ids |
insufficient_context_with_citations | Contradictory response | INSUFFICIENT_CONTEXT + citations |
invalid_citation_ids | Non-existent evidence IDs | Citing [E99] when only E1-E3 exist |
uncited_entities | Entities not in cited evidence | Mentions "Iran" but no evidence about Iran |
Grounding Report¶
Evaluation output includes grounding metrics:
{
"grounding_report": {
"summary": {
"total_verified": 70,
"verification_passed": 58,
"verification_failed": 12,
"pass_rate": 0.83
},
"response_categories": {
"grounded_answers": 45,
"insufficient_context": 13,
"errors": 0
},
"rejection_breakdown": [
{"reason": "uncited_entities", "count": 7, "rate": 0.10},
{"reason": "answer_without_citations", "count": 3, "rate": 0.04},
{"reason": "parse_error", "count": 2, "rate": 0.03}
]
}
}
Interpreting Results¶
High Pass Rate (> 80%)¶
- Retrieval is working well
- LLM is grounding answers in context
- Metrics are trustworthy
Low Pass Rate (< 60%)¶
Possible causes:
- Retrieval returning wrong context - Fix search configuration
- Questions require external knowledge - Revise question set
- LLM ignoring instructions - Check prompt quality
High INSUFFICIENT_CONTEXT Rate¶
Not necessarily bad! It means:
- The LLM correctly identifies when it can't answer
- Better than hallucinating
- Check if retrieval should have found the answer
Strict vs Lenient¶
Use Strict When¶
- Final evaluation metrics matter
- You need accurate faithfulness scores
- Questions should be answerable from the graph
Use Lenient When¶
- Exploring retrieval quality
- Questions may be unanswerable
- Want to see what LLM would answer regardless
Best Practices¶
1. Start with Strict Mode¶
This gives you accurate baseline metrics.
2. Analyze Rejections¶
Check what's being rejected:
import json
with open("ragas_output.json") as f:
results = json.load(f)
for r in results["detailed_results"]:
if r.get("grounding_rejected"):
print(f"Q: {r['question']}")
print(f"Reason: {r['rejection_reason']}")
3. Curate Questions¶
Remove questions that consistently fail grounding:
- Questions requiring external knowledge
- Ambiguous questions
- SQL-like aggregation questions
4. Compare Modes¶
Run both modes to understand the gap:
# Strict
aletheia evaluate-ragas --grounding-mode strict ...
# Lenient
aletheia evaluate-ragas --grounding-mode lenient ...
Large difference suggests parametric knowledge contamination.
Learn More¶
- RAGAS Metrics - Understanding metrics
- Avoiding Parametric Knowledge - Data design
- Question Format - Question design