Running Evaluations¶
This guide covers how to evaluate your knowledge graph's retrieval quality using RAGAS.
What RAGAS Measures¶
RAGAS (Retrieval-Augmented Generation Assessment) evaluates:
| Metric | Measures | Good Score |
|---|---|---|
| Context Precision | Are retrieved results relevant? | > 0.7 |
| Context Recall | Is all needed info retrieved? | > 0.7 |
| Faithfulness | Is the answer grounded in context? | > 0.7 |
| Answer Similarity | Does answer match expected? | > 0.7 |
Running an Evaluation¶
Basic Evaluation¶
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions use_cases/terrorist_orgs/evaluation_questions.json \
--output-dir output/
With Grounding Verification¶
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions questions.json \
--grounding-mode strict
Grounding modes:
| Mode | Behavior |
|---|---|
strict | Reject answers not grounded in evidence |
lenient | Warn but include all answers |
off | Legacy mode, no verification |
With Community Search¶
Include community context:
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions questions.json \
--use-community-search
Baseline Comparison¶
Compare BFS+cosine vs cosine-only:
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions questions.json \
--compare-baseline
Question Format¶
Questions must be in JSON format:
{
"questions": [
{
"id": "q1",
"question": "What alias is used for al-Shabaab?",
"answer": "al-Hijra",
"answer_aliases": ["Al-Hijra", "al Hijra"]
},
{
"id": "q2",
"question": "Is Hamas designated as an FTO?",
"answer": "Yes, Hamas is designated as a Foreign Terrorist Organization by the US State Department."
}
]
}
Fields:
| Field | Required | Description |
|---|---|---|
id | Yes | Unique question identifier |
question | Yes | The question to ask |
answer | Yes | Expected/gold answer |
answer_aliases | No | Alternative correct answers |
Output Files¶
Evaluations produce two files:
JSON Output¶
ragas_YYYYMMDD_HHMMSS.json:
{
"aggregate_scores": {
"context_precision": 0.72,
"context_recall": 0.68,
"faithfulness": 0.81,
"answer_similarity": 0.75
},
"grounding_report": {
"summary": {
"total_verified": 70,
"verification_passed": 58,
"pass_rate": 0.83
}
},
"detailed_results": [...]
}
Markdown Report¶
ragas_YYYYMMDD_HHMMSS.md:
# RAGAS Evaluation Report
## Summary
| Metric | Score |
|--------|-------|
| Context Precision | 0.72 |
| Context Recall | 0.68 |
| Faithfulness | 0.81 |
| Answer Similarity | 0.75 |
## Per-Question Results
...
Interpreting Results¶
High Scores¶
- High Precision + High Recall: Retrieval is working well
- High Faithfulness: Answers are grounded in context
- High Similarity: Answers match expected
Problem Patterns¶
| Pattern | Likely Cause | Solution |
|---|---|---|
| Low Precision, High Recall | Too many irrelevant results | Reduce search limit |
| High Precision, Low Recall | Missing relevant results | Increase BFS depth |
| Low Faithfulness | Answer from parametric knowledge | Check grounding mode |
| High Similarity, Low Faithfulness | LLM "knows" the answer | Use obscure test data |
INSUFFICIENT_CONTEXT¶
If many questions return INSUFFICIENT_CONTEXT:
- Check that data was ingested correctly
- Verify search is returning results
- Consider if questions are too specific
Best Practices¶
1. Use Domain-Specific Questions¶
Questions should require data only in your graph:
// Good - requires specific graph data
{"question": "What alias does the PKK use?"}
// Bad - LLM may know this from training
{"question": "What is the capital of France?"}
2. Curate Question Types¶
Different question types test different capabilities:
| Type | Example | Tests |
|---|---|---|
| Alias lookup | "What's another name for X?" | Node properties |
| Relationship | "Who sanctioned X?" | Edge traversal |
| Multi-hop | "What's the parent org of X's affiliate?" | Graph traversal |
3. Use Grounding Verification¶
Always use --grounding-mode strict for accurate metrics. This prevents parametric knowledge from inflating scores.
4. Compare Baselines¶
Use --compare-baseline to understand the impact of BFS traversal vs pure semantic search.
Learn More¶
- RAGAS Metrics - Deep dive into metrics
- Grounding Verification - How grounding works
- Question Format - Question design