Schema Inference: Choosing the Right Strategy¶
This FAQ explains why schema inference matters, what options are available, and how to choose the right strategy for your use case.
Why Schema Inference?¶
The Problem¶
When building a knowledge graph, Graphiti's LLM extracts entities and relationships from text. Without guidance, extraction is unconstrained:
Input: "Hamas is a terrorist organization designated by the US State Department"
Without schema:
- Entity: "Hamas" (type: ???)
- Entity: "US State Department" (type: ???)
- Relationship: ??? → ???
The LLM must decide: - What entity types to create (Person? Organization? TerroristGroup? GovernmentAgency?) - What relationship types to use (DESIGNATED_BY? SANCTIONS? IS_A?) - What properties each entity should have
The Consequences¶
Without schema guidance, you get:
| Problem | Example |
|---|---|
| Inconsistent types | Same concept labeled "Person", "Individual", "Human" |
| Redundant relationships | "LOCATED_IN", "IS_LOCATED_IN", "BASED_IN" for same meaning |
| Missing structure | Important relationships not extracted |
| Semantic drift | Types evolve unpredictably across documents |
In one real evaluation, unconstrained extraction produced 579 unique relationship types with massive overlap.
The Solution: Schema Inference¶
Schema inference provides the LLM with a vocabulary of entity types, relationship types, and properties to use during extraction:
# With schema guidance
entity_types = {"Organization": Organization, "Sanction": Sanction}
edge_types = {"SANCTIONS": Sanctions, "HAS_ALIAS": HasAlias}
# Extraction becomes consistent
await graphiti.add_episode(
...,
entity_types=entity_types,
edge_types=edge_types,
)
Available Schema Modes¶
Aletheia provides 6 distinct schema modes (plus an inference alias for llm), each with different tradeoffs:
| Mode | Primary Source | LLM Role | Ontology Required | Best For |
|---|---|---|---|---|
none | Graphiti defaults | Full discretion | No | Quick prototyping |
llm | LLM inference | Primary | No | Unknown data structure |
inference | (Alias for llm) | Primary | No | Unknown data structure |
ontology | Ontology file | None | Yes | Strict formal domains |
hybrid | LLM + ontology validation | Primary + validation | Yes | Balanced approach |
graph-hybrid | LLM + semantic alignment | Primary + alignment | Yes | FTM data (recommended) |
ontology-first | Ontology | Enhancement only | Yes | Well-defined domains |
Mode Details¶
none - No Schema¶
What happens: Graphiti uses its default generic schema with Entity and RELATED_TO.
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode none
Pros: - Zero setup time - Works immediately
Cons: - No type consistency - Poor retrieval precision - Relationship types unpredictable
Use when: Quick prototyping, exploring data, don't care about graph quality.
llm / inference - LLM-Inferred Schema¶
What happens: Two-stage LLM analysis:
- Stage 1 (Domain Analysis): LLM analyzes sample data and generates a domain-specific extraction prompt
- Stage 2 (Schema Extraction): Uses generated prompt to extract structured schema
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode llm
Output: Generated schema saved to schemas/<graph_name>/schema_v1.py
📊 Stage 1: Domain Analysis
✓ Domain: Sanctions and terrorist organization data...
✓ Entities identified: 8
✓ Relationships identified: 5
✓ Extraction prompt saved: prompts/dynamic/my_graph/
📋 Stage 2: Schema Extraction
✓ Entity types extracted: 8
✓ Relationship types extracted: 5
Pros: - Discovers schema from data automatically - No ontology required - Good for unknown data structures
Cons: - Limited by sample data coverage - May miss important concepts not in samples - Schema quality depends on LLM
Use when: You don't have an ontology and want automatic schema discovery.
ontology - Strict Ontology Adherence¶
What happens: Schema extracted directly from ontology file. LLM has no input.
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode ontology
Pros: - Complete control over schema - No LLM variability - Matches formal domain models exactly
Cons: - Requires well-defined ontology - No flexibility for unexpected data - May reject valid entities not in ontology
Use when: You have a formal ontology (OWL/TTL) and need strict adherence.
hybrid - LLM + Ontology Validation¶
What happens:
- LLM infers schema from sample data
- Ontology validates and corrects entity names (fuzzy matching)
- Ontology enriches entities with additional properties
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode hybrid
Pros: - LLM discovers what's in data - Ontology provides consistency - Best of both worlds
Cons: - String-based matching (not semantic) - May miss semantic equivalents ("Airport" vs "Aerodrome")
Use when: You have an ontology but want LLM flexibility.
graph-hybrid - LLM + Semantic Graph Alignment (Recommended)¶
What happens: Three-phase process:
- Phase 1 (LLM-First): Unbiased LLM inference without ontology guidance
- Phase 2 (Semantic Alignment): Uses Graphiti search to align inferred types with ontology via embeddings
- Phase 3 (Property Enrichment): Adds ontology properties to aligned entities
# Step 1: Load ontology into graph (once)
aletheia build-ontology-graph \
--use-case my_case \
--knowledge-graph my_ontology
# Step 2: Build with graph-hybrid
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode graph-hybrid \
--ontology-graph my_ontology
Output:
🔮 Graph-Hybrid Mode: LLM-first + Semantic Alignment
📊 Phase 1: LLM-First Inference (Unbiased)
✓ Inferred 12 entities, 8 relationships
🔍 Phase 2: Semantic Alignment via Knowledge Graph
✓ Airport → Aerodrome General (89%)
✓ Person → Person (95%)
⚠️ CustomType → (kept as-is)
✓ Aligned 10 concepts (confidence threshold: 70%)
📚 Phase 3: Property Enrichment (Data-Driven)
✓ Enriched 8 entities:
- Person: +12 properties
- Organization: +8 properties
Key advantage: Semantic alignment bridges terminology gaps: - "Airport" matches "Aerodrome General" via embedding similarity - Multilingual support (Spanish "Persona" matches "Person") - Handles synonyms and domain variations
Pros: - LLM discovers naturally - Semantic alignment handles terminology gaps - Ontology enriches with properties - Alignment report for transparency
Cons: - Requires ontology loaded to graph - Alignment may fail for truly novel concepts - More complex setup
Use when: FTM data, well-defined ontology, need flexibility + consistency.
ontology-first - Ontology as Primary Source¶
What happens: Three-phase process where ontology is authoritative:
- Phase 1 (Load Ontology): Extract ALL entity/relationship types from ontology graph
- Phase 2 (LLM Enhancement): Optionally discover additional patterns not in ontology
- Phase 3 (Merge): Combine ontology base with LLM discoveries
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode ontology-first \
--ontology-graph my_ontology
Output:
🎯 Ontology-First Mode: Ontology as primary source
📚 Phase 1: Loading Ontology Schema (Complete)
✓ Loaded 45 entity types from ontology
✓ Loaded 23 relationship types from ontology
🔍 Phase 2: LLM Enhancement (Discovering additional patterns)
+ Discovered entity: CustomDataType
+ Discovered relationship: HAS_METADATA
🔗 Phase 3: Merging Schema
✓ Final entity types: 46 (45 from ontology + 1 discovered)
✓ Final relationship types: 24 (23 from ontology + 1 discovered)
Key advantage: All ontology concepts guaranteed in schema. LLM can only add, not replace.
Why this exists: In graph-hybrid, important ontology concepts like HAS_ALIAS were lost because the LLM didn't discover them from samples. ontology-first ensures nothing is lost.
Pros: - All ontology concepts preserved - LLM can still discover new patterns - Expert domain knowledge takes priority
Cons: - May include unused ontology concepts - Requires well-defined ontology
Use when: You have a complete ontology and can't afford to lose any concepts.
Decision Flowchart¶
Do you have an ontology?
│
├─► No
│ │
│ └─► Do you need consistent types?
│ │
│ ├─► No → Use `none` (quick prototyping)
│ │
│ └─► Yes → Use `llm` (automatic discovery)
│
└─► Yes
│
└─► Is the ontology complete/authoritative?
│
├─► Yes, use it exactly → Use `ontology-first`
│
└─► No, LLM should discover
│
└─► Need semantic alignment?
│
├─► Yes → Use `graph-hybrid` (recommended)
│
└─► No → Use `hybrid` (string matching)
Comparison Matrix¶
| Aspect | none | llm | ontology | hybrid | graph-hybrid | ontology-first |
|---|---|---|---|---|---|---|
| Setup time | None | Low | Medium | Medium | High | High |
| Ontology required | No | No | Yes | Yes | Yes (loaded to graph) | Yes (loaded to graph) |
| Type consistency | Poor | Good | Excellent | Good | Excellent | Excellent |
| Flexibility | High | High | Low | Medium | High | Medium |
| Handles unknown data | Yes | Yes | No | Yes | Yes | Partially |
| Semantic alignment | No | No | No | No | Yes | No |
| Preserves all ontology concepts | N/A | N/A | Yes | No | No | Yes |
| FTM data support | Poor | Fair | Good | Good | Excellent | Excellent |
Real-World Example: Terrorist Organizations Dataset¶
We built knowledge graphs from OpenSanctions FTM data using different modes:
Without Schema (none)¶
Problems: - "DESIGNATED_BY", "SANCTIONED_BY", "LISTED_BY" all mean the same thing - Retrieval precision suffered due to type fragmentation
With graph-hybrid¶
Benefits: - All sanctions relationships consolidated to SANCTION - Alias relationships consistently use HAS_ALIAS - FTM entity types preserved (Person, Organization, Sanction)
Configuration Options¶
Alignment Confidence Threshold¶
For graph-hybrid mode, control how strict semantic alignment must be:
aletheia build-knowledge-graph \
--schema-mode graph-hybrid \
--alignment-confidence 0.8 # Stricter (default: 0.7)
Higher values = fewer alignments but higher quality matches.
Alignment Report¶
Save alignment details for inspection:
aletheia build-knowledge-graph \
--schema-mode graph-hybrid \
--alignment-report output/alignment.json
{
"mode": "graph-hybrid",
"confidence_threshold": 0.7,
"entity_alignments": [
{
"inferred_name": "Airport",
"ontology_name": "Aerodrome General",
"confidence": 0.89,
"alternatives": ["Runway", "Heliport"]
}
],
"failed_entity_alignments": [
["CustomType", "No suitable alignment found"]
]
}
Summary Recommendations¶
| Scenario | Recommended Mode |
|---|---|
| Quick exploration | none |
| Unknown data, no ontology | llm |
| FTM/OpenSanctions data | graph-hybrid |
| Complete formal ontology | ontology-first |
| Need strict schema control | ontology |
| Ontology + flexibility | hybrid or graph-hybrid |
Phase 4: Consolidation (All Modes)¶
Regardless of which mode you choose, Aletheia applies a Phase 4 consolidation step after mode-specific processing. An LLM reviews the complete schema for redundancies, merges semantically similar types, and normalizes naming inconsistencies. Ontology-derived and extension types are protected from removal.
This step reduces the type fragmentation problem described above — even in llm mode, Phase 4 catches cases where the LLM generated both HAS_OPERATOR and OPERATED_BY for the same concept.
Default recommendation: Use graph-hybrid for production knowledge graphs with FTM data. It provides the best balance of flexibility (LLM discovers what's in your data) and consistency (semantic alignment with ontology).