Skip to content

Grounding Verification

Grounding verification ensures that answers are based on retrieved evidence, not the LLM's parametric knowledge.

The Problem

LLMs have extensive world knowledge from training. When evaluating RAG:

  • An LLM might answer correctly from memory, not from context
  • This inflates evaluation metrics
  • You don't actually know if retrieval is working

Grounding Modes

aletheia evaluate-ragas \
  --knowledge-graph my_graph \
  --questions questions.json \
  --grounding-mode <mode>
Mode Behavior Use When
strict Reject ungrounded answers Default, recommended
lenient Warn but include all Exploring retrieval
off No verification Backward compatibility

How It Works

1. Evidence Presentation

Context is presented as numbered evidence units:

EVIDENCE:
[E1] Hamas is designated as an FTO by the US State Department
[E2] Hamas has the alias Islamic Resistance Movement
[E3] Hezbollah is designated by UK and US authorities

2. Citation Requirement

The LLM must cite evidence:

{
  "answer": "Hamas is designated as an FTO",
  "evidence_ids": ["E1"],
  "reasoning": "E1 directly states Hamas is designated as FTO by US State Department"
}

3. Verification

Aletheia verifies: - Are cited evidence IDs valid? - Are entities in the answer mentioned in cited evidence? - Is INSUFFICIENT_CONTEXT used appropriately?

Rejection Reasons

Reason Description Example
parse_error JSON response malformed Invalid JSON syntax
answer_without_citations Answer but no evidence cited Missing evidence_ids
insufficient_context_with_citations Contradictory response INSUFFICIENT_CONTEXT + citations
invalid_citation_ids Non-existent evidence IDs Citing [E99] when only E1-E3 exist
uncited_entities Entities not in cited evidence Mentions "Iran" but no evidence about Iran

Grounding Report

Evaluation output includes grounding metrics:

{
  "grounding_report": {
    "summary": {
      "total_verified": 70,
      "verification_passed": 58,
      "verification_failed": 12,
      "pass_rate": 0.83
    },
    "response_categories": {
      "grounded_answers": 45,
      "insufficient_context": 13,
      "errors": 0
    },
    "rejection_breakdown": [
      {"reason": "uncited_entities", "count": 7, "rate": 0.10},
      {"reason": "answer_without_citations", "count": 3, "rate": 0.04},
      {"reason": "parse_error", "count": 2, "rate": 0.03}
    ]
  }
}

Interpreting Results

High Pass Rate (> 80%)

  • Retrieval is working well
  • LLM is grounding answers in context
  • Metrics are trustworthy

Low Pass Rate (< 60%)

Possible causes:

  1. Retrieval returning wrong context - Fix search configuration
  2. Questions require external knowledge - Revise question set
  3. LLM ignoring instructions - Check prompt quality

High INSUFFICIENT_CONTEXT Rate

Not necessarily bad! It means:

  • The LLM correctly identifies when it can't answer
  • Better than hallucinating
  • Check if retrieval should have found the answer

Strict vs Lenient

Use Strict When

  • Final evaluation metrics matter
  • You need accurate faithfulness scores
  • Questions should be answerable from the graph

Use Lenient When

  • Exploring retrieval quality
  • Questions may be unanswerable
  • Want to see what LLM would answer regardless

Best Practices

1. Start with Strict Mode

aletheia evaluate-ragas \
  --grounding-mode strict \
  --questions questions.json

This gives you accurate baseline metrics.

2. Analyze Rejections

Check what's being rejected:

import json

with open("ragas_output.json") as f:
    results = json.load(f)

for r in results["detailed_results"]:
    if r.get("grounding_rejected"):
        print(f"Q: {r['question']}")
        print(f"Reason: {r['rejection_reason']}")

3. Curate Questions

Remove questions that consistently fail grounding:

  • Questions requiring external knowledge
  • Ambiguous questions
  • SQL-like aggregation questions

4. Compare Modes

Run both modes to understand the gap:

# Strict
aletheia evaluate-ragas --grounding-mode strict ...

# Lenient
aletheia evaluate-ragas --grounding-mode lenient ...

Large difference suggests parametric knowledge contamination.

Learn More