Avoiding Parametric Knowledge¶
Parametric knowledge contamination occurs when an LLM answers from its training data rather than retrieved context. This guide explains how to detect and prevent it.
The Problem¶
Question: What is the capital of France?
Retrieved Context: [Information about French geography]
LLM Answer: Paris
Did the LLM answer from context or from memory?
For common knowledge, you can't tell. This makes evaluation unreliable.
Detection Signs¶
1. High Answer Similarity + Low Faithfulness¶
| Metric | Score | Interpretation |
|---|---|---|
| Answer Similarity | 0.95 | Correct answer |
| Faithfulness | 0.40 | Not grounded in context |
The LLM "knew" the answer but didn't use the context.
2. Correct Answers with No Context¶
If retrieval returns nothing but answers are correct, the LLM is using parametric knowledge.
3. Grounding Rejections¶
In strict mode, look for: - uncited_entities - Answer mentions things not in evidence - answer_without_citations - No evidence cited
Prevention Strategies¶
1. Use Domain-Specific Data¶
Choose data the LLM hasn't seen:
| Good | Bad |
|---|---|
| Internal company documents | Wikipedia content |
| Recent news (post-training cutoff) | Historical facts |
| Obscure sanctions data | Common knowledge |
| Proprietary databases | Public datasets |
2. Use Grounding Verification¶
Strict mode rejects answers not grounded in evidence.
3. Generate Synthetic Questions¶
Create questions from your graph that can only be answered with your specific data:
# Extract facts from your graph
facts = query_graph_facts(graphiti, group_id)
# Generate questions
for fact in facts:
question = f"What is the relationship between {fact.source} and {fact.target}?"
# This question requires your specific graph data
4. Use Recent Data¶
Data from after the LLM's training cutoff:
# Download latest sanctions data (updated weekly)
curl -o entities.ftm.json \
"https://data.opensanctions.org/datasets/latest/..."
5. Use Obscure Data¶
Even historical data can work if it's obscure:
| Likely Known | Likely Unknown |
|---|---|
| Major terrorist groups | Obscure shell companies |
| Famous sanctions | Specific sanction programs |
| Country capitals | Entity aliases |
Testing for Contamination¶
Method 1: No-Context Test¶
Run evaluation with empty context:
# In evaluation code
context = "" # Empty context
answer = llm.generate(question, context)
# If answers are correct, LLM is using parametric knowledge
Method 2: Wrong-Context Test¶
Provide deliberately wrong context:
# Swap context between questions
context_for_q1 = retrieve(q2)
answer = llm.generate(q1, context_for_q1)
# Correct answer = parametric knowledge
Method 3: Compare Grounding Modes¶
# Strict mode
aletheia evaluate-ragas --grounding-mode strict ...
# Result: Answer Similarity = 0.60
# Off mode
aletheia evaluate-ragas --grounding-mode off ...
# Result: Answer Similarity = 0.90
# Large gap = parametric knowledge contamination
The terrorist_orgs Dataset¶
The terrorist_orgs dataset is designed to minimize contamination:
- Specific aliases - LLMs don't know all aliases
- Program IDs - US-FTO219 vs general knowledge
- Cross-jurisdictions - UK proscriptions less known
- Recent designations - Post-training cutoff
aletheia evaluate-ragas \
--knowledge-graph terrorist_orgs \
--questions use_cases/terrorist_orgs/evaluation_questions.json \
--grounding-mode strict
Best Practices Summary¶
- Always use grounding verification in production evaluations
- Start with strict mode to establish baseline
- Prefer obscure/recent data for evaluation
- Generate questions from your graph for guaranteed coverage
- Monitor grounding pass rates - declining rates suggest contamination
- Curate question sets to remove obviously-known questions
Learn More¶
- Grounding Verification - How grounding works
- Question Format - Question design
- RAGAS Metrics - Understanding metrics