Skip to content

Schema Inference: Choosing the Right Strategy

This FAQ explains why schema inference matters, what options are available, and how to choose the right strategy for your use case.

Why Schema Inference?

The Problem

When building a knowledge graph, Graphiti's LLM extracts entities and relationships from text. Without guidance, extraction is unconstrained:

Input: "Hamas is a terrorist organization designated by the US State Department"

Without schema:
  - Entity: "Hamas" (type: ???)
  - Entity: "US State Department" (type: ???)
  - Relationship: ??? → ???

The LLM must decide: - What entity types to create (Person? Organization? TerroristGroup? GovernmentAgency?) - What relationship types to use (DESIGNATED_BY? SANCTIONS? IS_A?) - What properties each entity should have

The Consequences

Without schema guidance, you get:

Problem Example
Inconsistent types Same concept labeled "Person", "Individual", "Human"
Redundant relationships "LOCATED_IN", "IS_LOCATED_IN", "BASED_IN" for same meaning
Missing structure Important relationships not extracted
Semantic drift Types evolve unpredictably across documents

In one real evaluation, unconstrained extraction produced 579 unique relationship types with massive overlap.

The Solution: Schema Inference

Schema inference provides the LLM with a vocabulary of entity types, relationship types, and properties to use during extraction:

# With schema guidance
entity_types = {"Organization": Organization, "Sanction": Sanction}
edge_types = {"SANCTIONS": Sanctions, "HAS_ALIAS": HasAlias}

# Extraction becomes consistent
await graphiti.add_episode(
    ...,
    entity_types=entity_types,
    edge_types=edge_types,
)

Available Schema Modes

Aletheia provides 6 distinct schema modes (plus an inference alias for llm), each with different tradeoffs:

Mode Primary Source LLM Role Ontology Required Best For
none Graphiti defaults Full discretion No Quick prototyping
llm LLM inference Primary No Unknown data structure
inference (Alias for llm) Primary No Unknown data structure
ontology Ontology file None Yes Strict formal domains
hybrid LLM + ontology validation Primary + validation Yes Balanced approach
graph-hybrid LLM + semantic alignment Primary + alignment Yes FTM data (recommended)
ontology-first Ontology Enhancement only Yes Well-defined domains

Mode Details

none - No Schema

What happens: Graphiti uses its default generic schema with Entity and RELATED_TO.

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode none

Pros: - Zero setup time - Works immediately

Cons: - No type consistency - Poor retrieval precision - Relationship types unpredictable

Use when: Quick prototyping, exploring data, don't care about graph quality.


llm / inference - LLM-Inferred Schema

What happens: Two-stage LLM analysis:

  1. Stage 1 (Domain Analysis): LLM analyzes sample data and generates a domain-specific extraction prompt
  2. Stage 2 (Schema Extraction): Uses generated prompt to extract structured schema
aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode llm

Output: Generated schema saved to schemas/<graph_name>/schema_v1.py

📊 Stage 1: Domain Analysis
   ✓ Domain: Sanctions and terrorist organization data...
   ✓ Entities identified: 8
   ✓ Relationships identified: 5
   ✓ Extraction prompt saved: prompts/dynamic/my_graph/

📋 Stage 2: Schema Extraction
   ✓ Entity types extracted: 8
   ✓ Relationship types extracted: 5

Pros: - Discovers schema from data automatically - No ontology required - Good for unknown data structures

Cons: - Limited by sample data coverage - May miss important concepts not in samples - Schema quality depends on LLM

Use when: You don't have an ontology and want automatic schema discovery.


ontology - Strict Ontology Adherence

What happens: Schema extracted directly from ontology file. LLM has no input.

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode ontology

Pros: - Complete control over schema - No LLM variability - Matches formal domain models exactly

Cons: - Requires well-defined ontology - No flexibility for unexpected data - May reject valid entities not in ontology

Use when: You have a formal ontology (OWL/TTL) and need strict adherence.


hybrid - LLM + Ontology Validation

What happens:

  1. LLM infers schema from sample data
  2. Ontology validates and corrects entity names (fuzzy matching)
  3. Ontology enriches entities with additional properties
aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode hybrid

Pros: - LLM discovers what's in data - Ontology provides consistency - Best of both worlds

Cons: - String-based matching (not semantic) - May miss semantic equivalents ("Airport" vs "Aerodrome")

Use when: You have an ontology but want LLM flexibility.


What happens: Three-phase process:

  1. Phase 1 (LLM-First): Unbiased LLM inference without ontology guidance
  2. Phase 2 (Semantic Alignment): Uses Graphiti search to align inferred types with ontology via embeddings
  3. Phase 3 (Property Enrichment): Adds ontology properties to aligned entities
# Step 1: Load ontology into graph (once)
aletheia build-ontology-graph \
  --use-case my_case \
  --knowledge-graph my_ontology

# Step 2: Build with graph-hybrid
aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode graph-hybrid \
  --ontology-graph my_ontology

Output:

🔮 Graph-Hybrid Mode: LLM-first + Semantic Alignment

📊 Phase 1: LLM-First Inference (Unbiased)
   ✓ Inferred 12 entities, 8 relationships

🔍 Phase 2: Semantic Alignment via Knowledge Graph
   ✓ Airport → Aerodrome General (89%)
   ✓ Person → Person (95%)
   ⚠️ CustomType → (kept as-is)
   ✓ Aligned 10 concepts (confidence threshold: 70%)

📚 Phase 3: Property Enrichment (Data-Driven)
   ✓ Enriched 8 entities:
      - Person: +12 properties
      - Organization: +8 properties

Key advantage: Semantic alignment bridges terminology gaps: - "Airport" matches "Aerodrome General" via embedding similarity - Multilingual support (Spanish "Persona" matches "Person") - Handles synonyms and domain variations

Pros: - LLM discovers naturally - Semantic alignment handles terminology gaps - Ontology enriches with properties - Alignment report for transparency

Cons: - Requires ontology loaded to graph - Alignment may fail for truly novel concepts - More complex setup

Use when: FTM data, well-defined ontology, need flexibility + consistency.


ontology-first - Ontology as Primary Source

What happens: Three-phase process where ontology is authoritative:

  1. Phase 1 (Load Ontology): Extract ALL entity/relationship types from ontology graph
  2. Phase 2 (LLM Enhancement): Optionally discover additional patterns not in ontology
  3. Phase 3 (Merge): Combine ontology base with LLM discoveries
aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode ontology-first \
  --ontology-graph my_ontology

Output:

🎯 Ontology-First Mode: Ontology as primary source

📚 Phase 1: Loading Ontology Schema (Complete)
   ✓ Loaded 45 entity types from ontology
   ✓ Loaded 23 relationship types from ontology

🔍 Phase 2: LLM Enhancement (Discovering additional patterns)
   + Discovered entity: CustomDataType
   + Discovered relationship: HAS_METADATA

🔗 Phase 3: Merging Schema
   ✓ Final entity types: 46 (45 from ontology + 1 discovered)
   ✓ Final relationship types: 24 (23 from ontology + 1 discovered)

Key advantage: All ontology concepts guaranteed in schema. LLM can only add, not replace.

Why this exists: In graph-hybrid, important ontology concepts like HAS_ALIAS were lost because the LLM didn't discover them from samples. ontology-first ensures nothing is lost.

Pros: - All ontology concepts preserved - LLM can still discover new patterns - Expert domain knowledge takes priority

Cons: - May include unused ontology concepts - Requires well-defined ontology

Use when: You have a complete ontology and can't afford to lose any concepts.


Decision Flowchart

Do you have an ontology?
├─► No
│   │
│   └─► Do you need consistent types?
│       │
│       ├─► No → Use `none` (quick prototyping)
│       │
│       └─► Yes → Use `llm` (automatic discovery)
└─► Yes
    └─► Is the ontology complete/authoritative?
        ├─► Yes, use it exactly → Use `ontology-first`
        └─► No, LLM should discover
            └─► Need semantic alignment?
                ├─► Yes → Use `graph-hybrid` (recommended)
                └─► No → Use `hybrid` (string matching)

Comparison Matrix

Aspect none llm ontology hybrid graph-hybrid ontology-first
Setup time None Low Medium Medium High High
Ontology required No No Yes Yes Yes (loaded to graph) Yes (loaded to graph)
Type consistency Poor Good Excellent Good Excellent Excellent
Flexibility High High Low Medium High Medium
Handles unknown data Yes Yes No Yes Yes Partially
Semantic alignment No No No No Yes No
Preserves all ontology concepts N/A N/A Yes No No Yes
FTM data support Poor Fair Good Good Excellent Excellent

Real-World Example: Terrorist Organizations Dataset

We built knowledge graphs from OpenSanctions FTM data using different modes:

Without Schema (none)

Nodes: 2,341
Relationships: 8,923
Unique relationship types: 312  ← Too many!

Problems: - "DESIGNATED_BY", "SANCTIONED_BY", "LISTED_BY" all mean the same thing - Retrieval precision suffered due to type fragmentation

With graph-hybrid

Nodes: 2,341
Relationships: 8,923
Unique relationship types: 18  ← Consistent!

Benefits: - All sanctions relationships consolidated to SANCTION - Alias relationships consistently use HAS_ALIAS - FTM entity types preserved (Person, Organization, Sanction)


Configuration Options

Alignment Confidence Threshold

For graph-hybrid mode, control how strict semantic alignment must be:

aletheia build-knowledge-graph \
  --schema-mode graph-hybrid \
  --alignment-confidence 0.8  # Stricter (default: 0.7)

Higher values = fewer alignments but higher quality matches.

Alignment Report

Save alignment details for inspection:

aletheia build-knowledge-graph \
  --schema-mode graph-hybrid \
  --alignment-report output/alignment.json
{
  "mode": "graph-hybrid",
  "confidence_threshold": 0.7,
  "entity_alignments": [
    {
      "inferred_name": "Airport",
      "ontology_name": "Aerodrome General",
      "confidence": 0.89,
      "alternatives": ["Runway", "Heliport"]
    }
  ],
  "failed_entity_alignments": [
    ["CustomType", "No suitable alignment found"]
  ]
}

Summary Recommendations

Scenario Recommended Mode
Quick exploration none
Unknown data, no ontology llm
FTM/OpenSanctions data graph-hybrid
Complete formal ontology ontology-first
Need strict schema control ontology
Ontology + flexibility hybrid or graph-hybrid

Phase 4: Consolidation (All Modes)

Regardless of which mode you choose, Aletheia applies a Phase 4 consolidation step after mode-specific processing. An LLM reviews the complete schema for redundancies, merges semantically similar types, and normalizes naming inconsistencies. Ontology-derived and extension types are protected from removal.

This step reduces the type fragmentation problem described above — even in llm mode, Phase 4 catches cases where the LLM generated both HAS_OPERATOR and OPERATED_BY for the same concept.

Default recommendation: Use graph-hybrid for production knowledge graphs with FTM data. It provides the best balance of flexibility (LLM discovers what's in your data) and consistency (semantic alignment with ontology).