Skip to content

Running Evaluations

This guide covers how to evaluate your knowledge graph's retrieval quality using RAGAS.

What RAGAS Measures

RAGAS (Retrieval-Augmented Generation Assessment) evaluates:

Metric Measures Good Score
Context Precision Are retrieved results relevant? > 0.7
Context Recall Is all needed info retrieved? > 0.7
Faithfulness Is the answer grounded in context? > 0.7
Answer Similarity Does answer match expected? > 0.7

Running an Evaluation

Basic Evaluation

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions use_cases/terrorist_orgs/evaluation_questions.json \
  --output-dir output/

With Grounding Verification

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions questions.json \
  --grounding-mode strict

Grounding modes:

Mode Behavior
strict Reject answers not grounded in evidence
lenient Warn but include all answers
off Legacy mode, no verification

Include community context:

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions questions.json \
  --use-community-search

Baseline Comparison

Compare BFS+cosine vs cosine-only:

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions questions.json \
  --compare-baseline

Question Format

Questions must be in JSON format:

{
  "questions": [
    {
      "id": "q1",
      "question": "What alias is used for al-Shabaab?",
      "answer": "al-Hijra",
      "answer_aliases": ["Al-Hijra", "al Hijra"]
    },
    {
      "id": "q2",
      "question": "Is Hamas designated as an FTO?",
      "answer": "Yes, Hamas is designated as a Foreign Terrorist Organization by the US State Department."
    }
  ]
}

Fields:

Field Required Description
id Yes Unique question identifier
question Yes The question to ask
answer Yes Expected/gold answer
answer_aliases No Alternative correct answers

Output Files

Evaluations produce two files:

JSON Output

ragas_YYYYMMDD_HHMMSS.json:

{
  "aggregate_scores": {
    "context_precision": 0.72,
    "context_recall": 0.68,
    "faithfulness": 0.81,
    "answer_similarity": 0.75
  },
  "grounding_report": {
    "summary": {
      "total_verified": 70,
      "verification_passed": 58,
      "pass_rate": 0.83
    }
  },
  "detailed_results": [...]
}

Markdown Report

ragas_YYYYMMDD_HHMMSS.md:

# RAGAS Evaluation Report

## Summary
| Metric | Score |
|--------|-------|
| Context Precision | 0.72 |
| Context Recall | 0.68 |
| Faithfulness | 0.81 |
| Answer Similarity | 0.75 |

## Per-Question Results
...

Interpreting Results

High Scores

  • High Precision + High Recall: Retrieval is working well
  • High Faithfulness: Answers are grounded in context
  • High Similarity: Answers match expected

Problem Patterns

Pattern Likely Cause Solution
Low Precision, High Recall Too many irrelevant results Reduce search limit
High Precision, Low Recall Missing relevant results Increase BFS depth
Low Faithfulness Answer from parametric knowledge Check grounding mode
High Similarity, Low Faithfulness LLM "knows" the answer Use obscure test data

INSUFFICIENT_CONTEXT

If many questions return INSUFFICIENT_CONTEXT:

  1. Check that data was ingested correctly
  2. Verify search is returning results
  3. Consider if questions are too specific

Best Practices

1. Use Domain-Specific Questions

Questions should require data only in your graph:

// Good - requires specific graph data
{"question": "What alias does the PKK use?"}

// Bad - LLM may know this from training
{"question": "What is the capital of France?"}

2. Curate Question Types

Different question types test different capabilities:

Type Example Tests
Alias lookup "What's another name for X?" Node properties
Relationship "Who sanctioned X?" Edge traversal
Multi-hop "What's the parent org of X's affiliate?" Graph traversal

3. Use Grounding Verification

Always use --grounding-mode strict for accurate metrics. This prevents parametric knowledge from inflating scores.

4. Compare Baselines

Use --compare-baseline to understand the impact of BFS traversal vs pure semantic search.

Learn More