Running Evaluations¶

After building a knowledge graph, you need to know whether search and retrieval actually work — whether the right entities surface for the right questions, and whether the answers are grounded in evidence rather than LLM parametric knowledge. This guide covers how to evaluate retrieval quality using RAGAS.

What RAGAS Measures¶

RAGAS (Retrieval-Augmented Generation Assessment) evaluates:

Metric	Measures	Good Score
Context Precision	Are retrieved results relevant?	> 0.7
Context Recall	Is all needed info retrieved?	> 0.7
Faithfulness	Is the answer grounded in context?	> 0.7
Answer Similarity	Does answer match expected?	> 0.7

Running an Evaluation¶

Basic Evaluation¶

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions use_cases/terrorist_orgs/evaluation_questions.json \
  --output-dir output/

With Grounding Verification¶

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions questions.json \
  --grounding-mode strict

Grounding modes:

Mode	Behavior
`strict`	Reject answers not grounded in evidence
`lenient`	Warn but include all answers
`off`	Legacy mode, no verification

With Community Search¶

Include community context:

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions questions.json \
  --use-community-search

Baseline Comparison¶

Compare BFS+cosine vs cosine-only:

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions questions.json \
  --compare-baseline

Question Format¶

Questions must be in JSON format:

{
  "questions": [
    {
      "id": "q1",
      "question": "What alias is used for al-Shabaab?",
      "answer": "al-Hijra",
      "answer_aliases": ["Al-Hijra", "al Hijra"]
    },
    {
      "id": "q2",
      "question": "Is Hamas designated as an FTO?",
      "answer": "Yes, Hamas is designated as a Foreign Terrorist Organization by the US State Department."
    }
  ]
}

Fields:

Field	Required	Description
`id`	Yes	Unique question identifier
`question`	Yes	The question to ask
`answer`	Yes	Expected/gold answer
`answer_aliases`	No	Alternative correct answers

Output Files¶

Evaluations produce two files:

JSON Output¶

ragas_YYYYMMDD_HHMMSS.json:

{
  "aggregate_scores": {
    "context_precision": 0.72,
    "context_recall": 0.68,
    "faithfulness": 0.81,
    "answer_similarity": 0.75
  },
  "grounding_report": {
    "summary": {
      "total_verified": 70,
      "verification_passed": 58,
      "pass_rate": 0.83
    }
  },
  "detailed_results": [...]
}

Markdown Report¶

ragas_YYYYMMDD_HHMMSS.md:

# RAGAS Evaluation Report

## Summary
| Metric | Score |
|--------|-------|
| Context Precision | 0.72 |
| Context Recall | 0.68 |
| Faithfulness | 0.81 |
| Answer Similarity | 0.75 |

## Per-Question Results
...

Interpreting Results¶

High Scores¶

High Precision + High Recall: Retrieval is working well
High Faithfulness: Answers are grounded in context
High Similarity: Answers match expected

Problem Patterns¶

Pattern	Likely Cause	Solution
Low Precision, High Recall	Too many irrelevant results	Reduce search limit
High Precision, Low Recall	Missing relevant results	Increase BFS depth
Low Faithfulness	Answer from parametric knowledge	Check grounding mode
High Similarity, Low Faithfulness	LLM "knows" the answer	Use obscure test data

INSUFFICIENT_CONTEXT¶

If many questions return INSUFFICIENT_CONTEXT:

Check that data was ingested correctly
Verify search is returning results
Consider if questions are too specific

Best Practices¶

1. Use Domain-Specific Questions¶

Questions should require data only in your graph:

// Good - requires specific graph data
{"question": "What alias does the PKK use?"}

// Bad - LLM may know this from training
{"question": "What is the capital of France?"}

2. Curate Question Types¶

Different question types test different capabilities:

Type	Example	Tests
Alias lookup	"What's another name for X?"	Node properties
Relationship	"Who sanctioned X?"	Edge traversal
Multi-hop	"What's the parent org of X's affiliate?"	Graph traversal

3. Use Grounding Verification¶

Always use --grounding-mode strict for accurate metrics. This prevents parametric knowledge from inflating scores.

4. Compare Baselines¶

Use --compare-baseline to understand the impact of BFS traversal vs pure semantic search.

Learn More¶

RAGAS Metrics - Deep dive into metrics
Grounding Verification - How grounding works
Question Format - Question design