Running Evaluations¶
After building a knowledge graph, you need to know whether search and retrieval actually work — whether the right entities surface for the right questions, and whether the answers are grounded in evidence rather than LLM parametric knowledge. This guide covers how to evaluate retrieval quality using RAGAS.
What RAGAS Measures¶
RAGAS (Retrieval-Augmented Generation Assessment) evaluates:
| Metric | Measures | Good Score |
|---|---|---|
| Context Precision | Are retrieved results relevant? | > 0.7 |
| Context Recall | Is all needed info retrieved? | > 0.7 |
| Faithfulness | Is the answer grounded in context? | > 0.7 |
| Answer Similarity | Does answer match expected? | > 0.7 |
Running an Evaluation¶
Basic Evaluation¶
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions use_cases/terrorist_orgs/evaluation_questions.json \
--output-dir output/
With Grounding Verification¶
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions questions.json \
--grounding-mode strict
Grounding modes:
| Mode | Behavior |
|---|---|
strict | Reject answers not grounded in evidence |
lenient | Warn but include all answers |
off | Legacy mode, no verification |
With Community Search¶
Include community context:
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions questions.json \
--use-community-search
Baseline Comparison¶
Compare BFS+cosine vs cosine-only:
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions questions.json \
--compare-baseline
Question Format¶
Questions must be in JSON format:
{
"questions": [
{
"id": "q1",
"question": "What alias is used for al-Shabaab?",
"answer": "al-Hijra",
"answer_aliases": ["Al-Hijra", "al Hijra"]
},
{
"id": "q2",
"question": "Is Hamas designated as an FTO?",
"answer": "Yes, Hamas is designated as a Foreign Terrorist Organization by the US State Department."
}
]
}
Fields:
| Field | Required | Description |
|---|---|---|
id | Yes | Unique question identifier |
question | Yes | The question to ask |
answer | Yes | Expected/gold answer |
answer_aliases | No | Alternative correct answers |
Output Files¶
Evaluations produce two files:
JSON Output¶
ragas_YYYYMMDD_HHMMSS.json:
{
"aggregate_scores": {
"context_precision": 0.72,
"context_recall": 0.68,
"faithfulness": 0.81,
"answer_similarity": 0.75
},
"grounding_report": {
"summary": {
"total_verified": 70,
"verification_passed": 58,
"pass_rate": 0.83
}
},
"detailed_results": [...]
}
Markdown Report¶
ragas_YYYYMMDD_HHMMSS.md:
# RAGAS Evaluation Report
## Summary
| Metric | Score |
|--------|-------|
| Context Precision | 0.72 |
| Context Recall | 0.68 |
| Faithfulness | 0.81 |
| Answer Similarity | 0.75 |
## Per-Question Results
...
Interpreting Results¶
High Scores¶
- High Precision + High Recall: Retrieval is working well
- High Faithfulness: Answers are grounded in context
- High Similarity: Answers match expected
Problem Patterns¶
| Pattern | Likely Cause | Solution |
|---|---|---|
| Low Precision, High Recall | Too many irrelevant results | Reduce search limit |
| High Precision, Low Recall | Missing relevant results | Increase BFS depth |
| Low Faithfulness | Answer from parametric knowledge | Check grounding mode |
| High Similarity, Low Faithfulness | LLM "knows" the answer | Use obscure test data |
INSUFFICIENT_CONTEXT¶
If many questions return INSUFFICIENT_CONTEXT:
- Check that data was ingested correctly
- Verify search is returning results
- Consider if questions are too specific
Best Practices¶
1. Use Domain-Specific Questions¶
Questions should require data only in your graph:
// Good - requires specific graph data
{"question": "What alias does the PKK use?"}
// Bad - LLM may know this from training
{"question": "What is the capital of France?"}
2. Curate Question Types¶
Different question types test different capabilities:
| Type | Example | Tests |
|---|---|---|
| Alias lookup | "What's another name for X?" | Node properties |
| Relationship | "Who sanctioned X?" | Edge traversal |
| Multi-hop | "What's the parent org of X's affiliate?" | Graph traversal |
3. Use Grounding Verification¶
Always use --grounding-mode strict for accurate metrics. This prevents parametric knowledge from inflating scores.
4. Compare Baselines¶
Use --compare-baseline to understand the impact of BFS traversal vs pure semantic search.
Learn More¶
- RAGAS Metrics - Deep dive into metrics
- Grounding Verification - How grounding works
- Question Format - Question design