Avoiding Parametric Knowledge¶

Parametric knowledge contamination occurs when an LLM answers from its training data rather than retrieved context. This guide explains how to detect and prevent it.

The Problem¶

Question: What is the capital of France?
Retrieved Context: [Information about French geography]
LLM Answer: Paris

Did the LLM answer from context or from memory?

For common knowledge, you can't tell. This makes evaluation unreliable.

Detection Signs¶

1. High Answer Similarity + Low Faithfulness¶

Metric	Score	Interpretation
Answer Similarity	0.95	Correct answer
Faithfulness	0.40	Not grounded in context

The LLM "knew" the answer but didn't use the context.

2. Correct Answers with No Context¶

If retrieval returns nothing but answers are correct, the LLM is using parametric knowledge.

3. Grounding Rejections¶

In strict mode, look for: - uncited_entities - Answer mentions things not in evidence - answer_without_citations - No evidence cited

Prevention Strategies¶

1. Use Domain-Specific Data¶

Choose data the LLM hasn't seen:

Good	Bad
Internal company documents	Wikipedia content
Recent news (post-training cutoff)	Historical facts
Obscure sanctions data	Common knowledge
Proprietary databases	Public datasets

2. Use Grounding Verification¶

aletheia evaluate-ragas \
  --grounding-mode strict \
  --questions questions.json

Strict mode rejects answers not grounded in evidence.

3. Generate Synthetic Questions¶

Create questions from your graph that can only be answered with your specific data:

# Extract facts from your graph
facts = query_graph_facts(graphiti, group_id)

# Generate questions
for fact in facts:
    question = f"What is the relationship between {fact.source} and {fact.target}?"
    # This question requires your specific graph data

4. Use Recent Data¶

Data from after the LLM's training cutoff:

# Download latest sanctions data (updated weekly)
curl -o entities.ftm.json \
  "https://data.opensanctions.org/datasets/latest/..."

5. Use Obscure Data¶

Even historical data can work if it's obscure:

Likely Known	Likely Unknown
Major terrorist groups	Obscure shell companies
Famous sanctions	Specific sanction programs
Country capitals	Entity aliases

Testing for Contamination¶

Method 1: No-Context Test¶

Run evaluation with empty context:

# In evaluation code
context = ""  # Empty context
answer = llm.generate(question, context)

# If answers are correct, LLM is using parametric knowledge

Method 2: Wrong-Context Test¶

Provide deliberately wrong context:

# Swap context between questions
context_for_q1 = retrieve(q2)
answer = llm.generate(q1, context_for_q1)

# Correct answer = parametric knowledge

Method 3: Compare Grounding Modes¶

# Strict mode
aletheia evaluate-ragas --grounding-mode strict ...
# Result: Answer Similarity = 0.60

# Off mode
aletheia evaluate-ragas --grounding-mode off ...
# Result: Answer Similarity = 0.90

# Large gap = parametric knowledge contamination

The terrorist_orgs Dataset¶

The terrorist_orgs dataset is designed to minimize contamination:

Specific aliases - LLMs don't know all aliases
Program IDs - US-FTO219 vs general knowledge
Cross-jurisdictions - UK proscriptions less known
Recent designations - Post-training cutoff

aletheia evaluate-ragas \
  --knowledge-graph terrorist_orgs \
  --questions use_cases/terrorist_orgs/evaluation_questions.json \
  --grounding-mode strict

Best Practices Summary¶

Always use grounding verification in production evaluations
Start with strict mode to establish baseline
Prefer obscure/recent data for evaluation
Generate questions from your graph for guaranteed coverage
Monitor grounding pass rates - declining rates suggest contamination
Curate question sets to remove obviously-known questions

Learn More¶

Grounding Verification - How grounding works
Question Format - Question design
RAGAS Metrics - Understanding metrics