Grounding Verification¶

Grounding verification ensures that answers are based on retrieved evidence, not the LLM's parametric knowledge.

The Problem¶

LLMs have extensive world knowledge from training. When evaluating RAG:

An LLM might answer correctly from memory, not from context
This inflates evaluation metrics
You don't actually know if retrieval is working

Grounding Modes¶

aletheia evaluate-ragas \
  --knowledge-graph my_graph \
  --questions questions.json \
  --grounding-mode <mode>

Mode	Behavior	Use When
`strict`	Reject ungrounded answers	Default, recommended
`lenient`	Warn but include all	Exploring retrieval
`off`	No verification	Backward compatibility

How It Works¶

1. Evidence Presentation¶

Context is presented as numbered evidence units:

EVIDENCE:
[E1] Hamas is designated as an FTO by the US State Department
[E2] Hamas has the alias Islamic Resistance Movement
[E3] Hezbollah is designated by UK and US authorities

2. Citation Requirement¶

The LLM must cite evidence:

{
  "answer": "Hamas is designated as an FTO",
  "evidence_ids": ["E1"],
  "reasoning": "E1 directly states Hamas is designated as FTO by US State Department"
}

3. Verification¶

Aletheia verifies: - Are cited evidence IDs valid? - Are entities in the answer mentioned in cited evidence? - Is INSUFFICIENT_CONTEXT used appropriately?

Rejection Reasons¶

Reason	Description	Example
`parse_error`	JSON response malformed	Invalid JSON syntax
`answer_without_citations`	Answer but no evidence cited	Missing evidence_ids
`insufficient_context_with_citations`	Contradictory response	INSUFFICIENT_CONTEXT + citations
`invalid_citation_ids`	Non-existent evidence IDs	Citing [E99] when only E1-E3 exist
`uncited_entities`	Entities not in cited evidence	Mentions "Iran" but no evidence about Iran

Grounding Report¶

Evaluation output includes grounding metrics:

{
  "grounding_report": {
    "summary": {
      "total_verified": 70,
      "verification_passed": 58,
      "verification_failed": 12,
      "pass_rate": 0.83
    },
    "response_categories": {
      "grounded_answers": 45,
      "insufficient_context": 13,
      "errors": 0
    },
    "rejection_breakdown": [
      {"reason": "uncited_entities", "count": 7, "rate": 0.10},
      {"reason": "answer_without_citations", "count": 3, "rate": 0.04},
      {"reason": "parse_error", "count": 2, "rate": 0.03}
    ]
  }
}

Interpreting Results¶

High Pass Rate (> 80%)¶

Retrieval is working well
LLM is grounding answers in context
Metrics are trustworthy

Low Pass Rate (< 60%)¶

Possible causes:

Retrieval returning wrong context - Fix search configuration
Questions require external knowledge - Revise question set
LLM ignoring instructions - Check prompt quality

High INSUFFICIENT_CONTEXT Rate¶

Not necessarily bad! It means:

The LLM correctly identifies when it can't answer
Better than hallucinating
Check if retrieval should have found the answer

Strict vs Lenient¶

Use Strict When¶

Final evaluation metrics matter
You need accurate faithfulness scores
Questions should be answerable from the graph

Use Lenient When¶

Exploring retrieval quality
Questions may be unanswerable
Want to see what LLM would answer regardless

Best Practices¶

1. Start with Strict Mode¶

aletheia evaluate-ragas \
  --grounding-mode strict \
  --questions questions.json

This gives you accurate baseline metrics.

2. Analyze Rejections¶

Check what's being rejected:

import json

with open("ragas_output.json") as f:
    results = json.load(f)

for r in results["detailed_results"]:
    if r.get("grounding_rejected"):
        print(f"Q: {r['question']}")
        print(f"Reason: {r['rejection_reason']}")

3. Curate Questions¶

Remove questions that consistently fail grounding:

Questions requiring external knowledge
Ambiguous questions
SQL-like aggregation questions

4. Compare Modes¶

Run both modes to understand the gap:

# Strict
aletheia evaluate-ragas --grounding-mode strict ...

# Lenient
aletheia evaluate-ragas --grounding-mode lenient ...

Large difference suggests parametric knowledge contamination.

Learn More¶

RAGAS Metrics - Understanding metrics
Avoiding Parametric Knowledge - Data design
Question Format - Question design