Schema Inference: Choosing the Right Strategy¶

This FAQ explains why schema inference matters, what options are available, and how to choose the right strategy for your use case.

Why Schema Inference?¶

The Problem¶

When building a knowledge graph, Graphiti's LLM extracts entities and relationships from text. Without guidance, extraction is unconstrained:

Input: "Hamas is a terrorist organization designated by the US State Department"

Without schema:
  - Entity: "Hamas" (type: ???)
  - Entity: "US State Department" (type: ???)
  - Relationship: ??? → ???

The LLM must decide: - What entity types to create (Person? Organization? TerroristGroup? GovernmentAgency?) - What relationship types to use (DESIGNATED_BY? SANCTIONS? IS_A?) - What properties each entity should have

The Consequences¶

Without schema guidance, you get:

Problem	Example
Inconsistent types	Same concept labeled "Person", "Individual", "Human"
Redundant relationships	"LOCATED_IN", "IS_LOCATED_IN", "BASED_IN" for same meaning
Missing structure	Important relationships not extracted
Semantic drift	Types evolve unpredictably across documents

In one real evaluation, unconstrained extraction produced 579 unique relationship types with massive overlap.

The Solution: Schema Inference¶

Schema inference provides the LLM with a vocabulary of entity types, relationship types, and properties to use during extraction:

# With schema guidance
entity_types = {"Organization": Organization, "Sanction": Sanction}
edge_types = {"SANCTIONS": Sanctions, "HAS_ALIAS": HasAlias}

# Extraction becomes consistent
await graphiti.add_episode(
    ...,
    entity_types=entity_types,
    edge_types=edge_types,
)

Available Schema Modes¶

Aletheia provides 6 distinct schema modes (plus an inference alias for llm), each with different tradeoffs:

Mode	Primary Source	LLM Role	Ontology Required	Best For
`none`	Graphiti defaults	Full discretion	No	Quick prototyping
`llm`	LLM inference	Primary	No	Unknown data structure
`inference`	(Alias for `llm`)	Primary	No	Unknown data structure
`ontology`	Ontology file	None	Yes	Strict formal domains
`hybrid`	LLM + ontology validation	Primary + validation	Yes	Balanced approach
`graph-hybrid`	LLM + semantic alignment	Primary + alignment	Yes	FTM data (recommended)
`ontology-first`	Ontology	Enhancement only	Yes	Well-defined domains

Mode Details¶

`none` - No Schema¶

What happens: Graphiti uses its default generic schema with Entity and RELATED_TO.

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode none

Pros: - Zero setup time - Works immediately

Cons: - No type consistency - Poor retrieval precision - Relationship types unpredictable

Use when: Quick prototyping, exploring data, don't care about graph quality.

`llm` / `inference` - LLM-Inferred Schema¶

What happens: Two-stage LLM analysis:

Stage 1 (Domain Analysis): LLM analyzes sample data and generates a domain-specific extraction prompt
Stage 2 (Schema Extraction): Uses generated prompt to extract structured schema

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode llm

Output: Generated schema saved to schemas/<graph_name>/schema_v1.py

📊 Stage 1: Domain Analysis
   ✓ Domain: Sanctions and terrorist organization data...
   ✓ Entities identified: 8
   ✓ Relationships identified: 5
   ✓ Extraction prompt saved: prompts/dynamic/my_graph/

📋 Stage 2: Schema Extraction
   ✓ Entity types extracted: 8
   ✓ Relationship types extracted: 5

Pros: - Discovers schema from data automatically - No ontology required - Good for unknown data structures

Cons: - Limited by sample data coverage - May miss important concepts not in samples - Schema quality depends on LLM

Use when: You don't have an ontology and want automatic schema discovery.

`ontology` - Strict Ontology Adherence¶

What happens: Schema extracted directly from ontology file. LLM has no input.

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode ontology

Pros: - Complete control over schema - No LLM variability - Matches formal domain models exactly

Cons: - Requires well-defined ontology - No flexibility for unexpected data - May reject valid entities not in ontology

Use when: You have a formal ontology (OWL/TTL) and need strict adherence.

`hybrid` - LLM + Ontology Validation¶

What happens:

LLM infers schema from sample data
Ontology validates and corrects entity names (fuzzy matching)
Ontology enriches entities with additional properties

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode hybrid

Pros: - LLM discovers what's in data - Ontology provides consistency - Best of both worlds

Cons: - String-based matching (not semantic) - May miss semantic equivalents ("Airport" vs "Aerodrome")

Use when: You have an ontology but want LLM flexibility.

`graph-hybrid` - LLM + Semantic Graph Alignment (Recommended)¶

What happens: Three-phase process:

Phase 1 (LLM-First): Unbiased LLM inference without ontology guidance
Phase 2 (Semantic Alignment): Uses Graphiti search to align inferred types with ontology via embeddings
Phase 3 (Property Enrichment): Adds ontology properties to aligned entities

# Step 1: Load ontology into graph (once)
aletheia build-ontology-graph \
  --use-case my_case \
  --knowledge-graph my_ontology

# Step 2: Build with graph-hybrid
aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode graph-hybrid \
  --ontology-graph my_ontology

Output:

🔮 Graph-Hybrid Mode: LLM-first + Semantic Alignment

📊 Phase 1: LLM-First Inference (Unbiased)
   ✓ Inferred 12 entities, 8 relationships

🔍 Phase 2: Semantic Alignment via Knowledge Graph
   ✓ Airport → Aerodrome General (89%)
   ✓ Person → Person (95%)
   ⚠️ CustomType → (kept as-is)
   ✓ Aligned 10 concepts (confidence threshold: 70%)

📚 Phase 3: Property Enrichment (Data-Driven)
   ✓ Enriched 8 entities:
      - Person: +12 properties
      - Organization: +8 properties

Key advantage: Semantic alignment bridges terminology gaps: - "Airport" matches "Aerodrome General" via embedding similarity - Multilingual support (Spanish "Persona" matches "Person") - Handles synonyms and domain variations

Pros: - LLM discovers naturally - Semantic alignment handles terminology gaps - Ontology enriches with properties - Alignment report for transparency

Cons: - Requires ontology loaded to graph - Alignment may fail for truly novel concepts - More complex setup

Use when: FTM data, well-defined ontology, need flexibility + consistency.

`ontology-first` - Ontology as Primary Source¶

What happens: Three-phase process where ontology is authoritative:

Phase 1 (Load Ontology): Extract ALL entity/relationship types from ontology graph
Phase 2 (LLM Enhancement): Optionally discover additional patterns not in ontology
Phase 3 (Merge): Combine ontology base with LLM discoveries

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode ontology-first \
  --ontology-graph my_ontology

Output:

🎯 Ontology-First Mode: Ontology as primary source

📚 Phase 1: Loading Ontology Schema (Complete)
   ✓ Loaded 45 entity types from ontology
   ✓ Loaded 23 relationship types from ontology

🔍 Phase 2: LLM Enhancement (Discovering additional patterns)
   + Discovered entity: CustomDataType
   + Discovered relationship: HAS_METADATA

🔗 Phase 3: Merging Schema
   ✓ Final entity types: 46 (45 from ontology + 1 discovered)
   ✓ Final relationship types: 24 (23 from ontology + 1 discovered)

Key advantage: All ontology concepts guaranteed in schema. LLM can only add, not replace.

Why this exists: In graph-hybrid, important ontology concepts like HAS_ALIAS were lost because the LLM didn't discover them from samples. ontology-first ensures nothing is lost.

Pros: - All ontology concepts preserved - LLM can still discover new patterns - Expert domain knowledge takes priority

Cons: - May include unused ontology concepts - Requires well-defined ontology

Use when: You have a complete ontology and can't afford to lose any concepts.

Decision Flowchart¶

Do you have an ontology?
│
├─► No
│   │
│   └─► Do you need consistent types?
│       │
│       ├─► No → Use `none` (quick prototyping)
│       │
│       └─► Yes → Use `llm` (automatic discovery)
│
└─► Yes
    │
    └─► Is the ontology complete/authoritative?
        │
        ├─► Yes, use it exactly → Use `ontology-first`
        │
        └─► No, LLM should discover
            │
            └─► Need semantic alignment?
                │
                ├─► Yes → Use `graph-hybrid` (recommended)
                │
                └─► No → Use `hybrid` (string matching)

Comparison Matrix¶

Aspect	`none`	`llm`	`ontology`	`hybrid`	`graph-hybrid`	`ontology-first`
Setup time	None	Low	Medium	Medium	High	High
Ontology required	No	No	Yes	Yes	Yes (loaded to graph)	Yes (loaded to graph)
Type consistency	Poor	Good	Excellent	Good	Excellent	Excellent
Flexibility	High	High	Low	Medium	High	Medium
Handles unknown data	Yes	Yes	No	Yes	Yes	Partially
Semantic alignment	No	No	No	No	Yes	No
Preserves all ontology concepts	N/A	N/A	Yes	No	No	Yes
FTM data support	Poor	Fair	Good	Good	Excellent	Excellent

Real-World Example: Terrorist Organizations Dataset¶

We built knowledge graphs from OpenSanctions FTM data using different modes:

Without Schema (`none`)¶

Nodes: 2,341
Relationships: 8,923
Unique relationship types: 312  ← Too many!

Problems: - "DESIGNATED_BY", "SANCTIONED_BY", "LISTED_BY" all mean the same thing - Retrieval precision suffered due to type fragmentation

With `graph-hybrid`¶

Nodes: 2,341
Relationships: 8,923
Unique relationship types: 18  ← Consistent!

Benefits: - All sanctions relationships consolidated to SANCTION - Alias relationships consistently use HAS_ALIAS - FTM entity types preserved (Person, Organization, Sanction)

Configuration Options¶

Alignment Confidence Threshold¶

For graph-hybrid mode, control how strict semantic alignment must be:

aletheia build-knowledge-graph \
  --schema-mode graph-hybrid \
  --alignment-confidence 0.8  # Stricter (default: 0.7)

Higher values = fewer alignments but higher quality matches.

Alignment Report¶

Save alignment details for inspection:

aletheia build-knowledge-graph \
  --schema-mode graph-hybrid \
  --alignment-report output/alignment.json

{
  "mode": "graph-hybrid",
  "confidence_threshold": 0.7,
  "entity_alignments": [
    {
      "inferred_name": "Airport",
      "ontology_name": "Aerodrome General",
      "confidence": 0.89,
      "alternatives": ["Runway", "Heliport"]
    }
  ],
  "failed_entity_alignments": [
    ["CustomType", "No suitable alignment found"]
  ]
}

Summary Recommendations¶

Scenario	Recommended Mode
Quick exploration	`none`
Unknown data, no ontology	`llm`
FTM/OpenSanctions data	`graph-hybrid`
Complete formal ontology	`ontology-first`
Need strict schema control	`ontology`
Ontology + flexibility	`hybrid` or `graph-hybrid`

Phase 4: Consolidation (All Modes)¶

Regardless of which mode you choose, Aletheia applies a Phase 4 consolidation step after mode-specific processing. An LLM reviews the complete schema for redundancies, merges semantically similar types, and normalizes naming inconsistencies. Ontology-derived and extension types are protected from removal.

This step reduces the type fragmentation problem described above — even in llm mode, Phase 4 catches cases where the LLM generated both HAS_OPERATOR and OPERATED_BY for the same concept.

Default recommendation: Use graph-hybrid for production knowledge graphs with FTM data. It provides the best balance of flexibility (LLM discovers what's in your data) and consistency (semantic alignment with ontology).