RAGAS Metrics¶

RAGAS (Retrieval-Augmented Generation Assessment) provides standardized metrics for evaluating RAG systems. This page explains each metric in detail.

Overview¶

Metric	Question Answered	Range
Context Precision	Are the top results relevant?	0-1
Context Recall	Did we find all needed info?	0-1
Faithfulness	Is the answer grounded?	0-1
Answer Similarity	Does answer match expected?	0-1

Context Precision¶

What it measures: The proportion of relevant items in the retrieved context.

Intuition: If you retrieve 10 items and 7 are relevant, precision is 0.7.

Formula:

Precision = Relevant Retrieved / Total Retrieved

Good score: > 0.7

Improving precision: - Reduce search limit to return fewer, more relevant results - Use query filters to narrow search scope - Improve embedding quality

Without Reference

Aletheia uses LLMContextPrecisionWithoutReference which doesn't require ground-truth context labels.

Context Recall¶

What it measures: The proportion of required information that was retrieved.

Intuition: If the answer requires 5 facts and you retrieved 4, recall is 0.8.

Formula:

Recall = Facts Retrieved / Facts Needed

Good score: > 0.7

Improving recall: - Increase BFS traversal depth - Increase search limit - Use community search for broader coverage - Ensure data was properly ingested

Faithfulness¶

What it measures: Whether the generated answer is grounded in the retrieved context.

Intuition: Does the answer only use information from the context, or does it make things up?

Scoring: - 1.0 = Every claim in the answer is supported by context - 0.5 = Half the claims are supported - 0.0 = No claims are supported

Good score: > 0.7

Low faithfulness causes: - LLM using parametric knowledge - Hallucinated details - Over-extrapolation from context

Improving faithfulness: - Use --grounding-mode strict - Use domain-specific data LLMs haven't seen - Strengthen grounding prompts

Answer Similarity¶

What it measures: Semantic similarity between generated and expected answer.

Intuition: Does the answer convey the same meaning as the gold answer?

Scoring: Cosine similarity of embeddings, normalized to 0-1.

Good score: > 0.7

Interpreting scores: - 0.9+ = Nearly identical meaning - 0.7-0.9 = Same core information - 0.5-0.7 = Partially correct - < 0.5 = Different or wrong answer

Metric Interactions¶

High Precision + Low Recall¶

Symptom: Few but relevant results.

Cause: Search is too narrow.

Solution: Increase limit, add BFS traversal.

Low Precision + High Recall¶

Symptom: Many results, lots of noise.

Cause: Search is too broad.

Solution: Decrease limit, use query filters.

High Similarity + Low Faithfulness¶

Symptom: Correct answer but not from context.

Diagnosis: LLM answering from parametric knowledge.

Solution: Use grounding verification, use obscure test data.

Low Faithfulness + Low Recall¶

Symptom: LLM making up answers because context is insufficient.

Cause: Retrieval not returning relevant information.

Solution: Fix retrieval first, then re-evaluate.

Baseline Comparison¶

The --compare-baseline flag runs two evaluations:

BFS + Cosine: Graph traversal combined with semantic search
Cosine Only: Pure semantic search

This helps quantify the value of graph structure in your domain.

Expected results: - Multi-hop questions: BFS should significantly outperform - Simple lookups: Similar performance - Entity relationships: BFS should help

Learn More¶

Running Evaluations - How to run evaluations
Grounding Verification - Prevent parametric knowledge
Question Format - Design effective questions