RAGAS Metrics¶
RAGAS (Retrieval-Augmented Generation Assessment) provides standardized metrics for evaluating RAG systems. This page explains each metric in detail.
Overview¶
| Metric | Question Answered | Range |
|---|---|---|
| Context Precision | Are the top results relevant? | 0-1 |
| Context Recall | Did we find all needed info? | 0-1 |
| Faithfulness | Is the answer grounded? | 0-1 |
| Answer Similarity | Does answer match expected? | 0-1 |
Context Precision¶
What it measures: The proportion of relevant items in the retrieved context.
Intuition: If you retrieve 10 items and 7 are relevant, precision is 0.7.
Formula:
Good score: > 0.7
Improving precision: - Reduce search limit to return fewer, more relevant results - Use query filters to narrow search scope - Improve embedding quality
Without Reference
Aletheia uses LLMContextPrecisionWithoutReference which doesn't require ground-truth context labels.
Context Recall¶
What it measures: The proportion of required information that was retrieved.
Intuition: If the answer requires 5 facts and you retrieved 4, recall is 0.8.
Formula:
Good score: > 0.7
Improving recall: - Increase BFS traversal depth - Increase search limit - Use community search for broader coverage - Ensure data was properly ingested
Faithfulness¶
What it measures: Whether the generated answer is grounded in the retrieved context.
Intuition: Does the answer only use information from the context, or does it make things up?
Scoring: - 1.0 = Every claim in the answer is supported by context - 0.5 = Half the claims are supported - 0.0 = No claims are supported
Good score: > 0.7
Low faithfulness causes: - LLM using parametric knowledge - Hallucinated details - Over-extrapolation from context
Improving faithfulness: - Use --grounding-mode strict - Use domain-specific data LLMs haven't seen - Strengthen grounding prompts
Answer Similarity¶
What it measures: Semantic similarity between generated and expected answer.
Intuition: Does the answer convey the same meaning as the gold answer?
Scoring: Cosine similarity of embeddings, normalized to 0-1.
Good score: > 0.7
Interpreting scores: - 0.9+ = Nearly identical meaning - 0.7-0.9 = Same core information - 0.5-0.7 = Partially correct - < 0.5 = Different or wrong answer
Metric Interactions¶
High Precision + Low Recall¶
Symptom: Few but relevant results.
Cause: Search is too narrow.
Solution: Increase limit, add BFS traversal.
Low Precision + High Recall¶
Symptom: Many results, lots of noise.
Cause: Search is too broad.
Solution: Decrease limit, use query filters.
High Similarity + Low Faithfulness¶
Symptom: Correct answer but not from context.
Diagnosis: LLM answering from parametric knowledge.
Solution: Use grounding verification, use obscure test data.
Low Faithfulness + Low Recall¶
Symptom: LLM making up answers because context is insufficient.
Cause: Retrieval not returning relevant information.
Solution: Fix retrieval first, then re-evaluate.
Baseline Comparison¶
The --compare-baseline flag runs two evaluations:
- BFS + Cosine: Graph traversal combined with semantic search
- Cosine Only: Pure semantic search
This helps quantify the value of graph structure in your domain.
Expected results: - Multi-hop questions: BFS should significantly outperform - Simple lookups: Similar performance - Entity relationships: BFS should help
Learn More¶
- Running Evaluations - How to run evaluations
- Grounding Verification - Prevent parametric knowledge
- Question Format - Design effective questions