Schema Inference¶

This section provides comprehensive documentation on how Aletheia infers and manages schemas for knowledge graph construction.

Schema Inference Overview

What is Schema Inference?¶

When building a knowledge graph, the LLM extracts entities and relationships from text. Schema inference determines what vocabulary of types the LLM should use during extraction.

Without schema guidance:

Input: "Hamas is a terrorist organization designated by the US State Department"

LLM decides freely:
  - Entity types: Person? Organization? TerroristGroup? GovernmentAgency?
  - Relationship types: DESIGNATED_BY? SANCTIONS? IS_A? ASSOCIATED_WITH?
  - Properties: ???

With schema guidance:

Input: "Hamas is a terrorist organization designated by the US State Department"

Using defined schema:
  - Entity: "Hamas" (type: Organization)
  - Entity: "US State Department" (type: Organization)
  - Relationship: SANCTION (US State Department → Hamas)

Why Schema Matters¶

Without Schema: Chaos¶

In one real evaluation, unconstrained extraction produced 579 unique relationship types with massive semantic overlap:

Variants	Should Be
LOCATED_IN, IS_LOCATED_IN, BASED_IN, SITUATED_IN	LOCATED_IN
DESIGNATED_BY, SANCTIONED_BY, LISTED_BY	SANCTION
WORKS_FOR, EMPLOYED_BY, WORKS_AT	EMPLOYED_BY

This fragmentation destroys retrieval precision—queries miss relevant results because the same relationship has dozens of names.

With Schema: Consistency¶

A well-defined schema ensures:

Type consistency: Same concepts always use same names
Relationship clarity: Clear, queryable relationship vocabulary
Property standardization: Consistent attribute names across entities
Better retrieval: Queries find all relevant results

Available Schema Modes¶

Aletheia provides 6 distinct schema modes (plus an alias) to balance automation vs control:

Mode	Description	Ontology Required	Recommended For
`none`	No schema, Graphiti defaults	No	Quick prototyping
`llm`	Two-stage LLM inference	No	Unknown data
`ontology`	Strict ontology adherence	Yes	Formal domains
`hybrid`	LLM + ontology validation	Yes	Balanced approach
`graph-hybrid`	LLM + semantic alignment	Yes	FTM data
`ontology-first`	Ontology primary, LLM enhancement	Yes	Complete ontologies

Decision Guide¶

graph TD
    A[Do you have an ontology?] -->|No| B[Need consistent types?]
    A -->|Yes| C[Is ontology complete/authoritative?]

    B -->|No| D[none]
    B -->|Yes| E[llm]

    C -->|Yes, use it exactly| F[ontology-first]
    C -->|No, LLM should discover| G[Need semantic alignment?]

    G -->|Yes| H[graph-hybrid]
    G -->|No| I[hybrid]

Quick recommendations:

Scenario	Mode
Exploring new data quickly	`none`
Unknown data, no ontology	`llm`
FTM/OpenSanctions data	`graph-hybrid`
Aviation/domain with formal ontology	`ontology-first`
Need strict schema control	`ontology`

Core Concepts¶

Entity Types¶

Entity types define the kinds of nodes in your knowledge graph:

class Organization(BaseModel):
    """A corporation, government body, or other organization."""
    jurisdiction: str | None = None
    incorporation_date: str | None = None
    status: str | None = None

Entity types are: - PascalCase names (Person, Organization, Aircraft) - Pydantic models with typed properties - Passed to Graphiti's entity_types parameter

Relationship Types¶

Relationship types define the kinds of edges:

class Sanction(BaseModel):
    """A sanction designation between entities."""
    pass

# Usage in EDGE_TYPES dict
EDGE_TYPES = {
    "SANCTION": Sanction,
    "HAS_ALIAS": HasAlias,
    "OWNS": Owns,
}

Relationship types are: - UPPER_SNAKE_CASE names (SANCTION, HAS_ALIAS, OWNS) - Pydantic models (usually empty, properties optional) - Passed to Graphiti's edge_types parameter

Generated Schema Files¶

Schema inference produces Python files in schemas/<graph_name>/:

schemas/
└── my_graph/
    ├── __init__.py
    ├── schema_v1.py      # Generated Pydantic models
    └── metadata.json     # Provenance information

Example schema_v1.py:

"""Generated schema for knowledge graph."""
from pydantic import BaseModel, Field

# Entity Types
class Person(BaseModel):
    """A natural person."""
    birth_date: str | None = None
    nationality: str | None = None

class Organization(BaseModel):
    """An organization or company."""
    jurisdiction: str | None = None

# Relationship Types
class Sanction(BaseModel):
    """A sanction designation."""
    pass

# Exports
ENTITY_TYPES = {
    "Person": Person,
    "Organization": Organization,
}

EDGE_TYPES = {
    "SANCTION": Sanction,
}

Ontologies¶

For modes that use ontologies (ontology, hybrid, graph-hybrid, ontology-first), you need:

TTL/OWL file defining classes and relationships
Ontology graph loaded into the database

# Load ontology into graph (once)
aletheia build-ontology-graph \
  --use-case my_case \
  --knowledge-graph my_ontology

See Ontology Mode for details on ontology format and loading.

Schema Inference Pipeline¶

┌─────────────────────────────────────────────────────────┐
│                    INPUT                                 │
├─────────────────────────────────────────────────────────┤
│  Sample Data (from parser)                              │
│  + Ontology (if applicable)                             │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│          MODE-SPECIFIC PROCESSING                        │
├─────────────────────────────────────────────────────────┤
│  - none: Use Graphiti defaults                          │
│  - llm/inference: Two-stage LLM analysis                │
│  - ontology: Extract from TTL/OWL                       │
│  - hybrid: LLM + ontology string validation             │
│  - graph-hybrid: LLM + semantic alignment via graph     │
│  - ontology-first: Ontology base + LLM enhancement      │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│          PHASE 4: CONSOLIDATION (all modes)              │
├─────────────────────────────────────────────────────────┤
│  LLM reviews schema for redundancies                    │
│  Merges semantically similar types                      │
│  Protects ontology/extension types from removal         │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                    OUTPUT                                │
├─────────────────────────────────────────────────────────┤
│  SchemaDefinition:                                      │
│  - entity_types: List[EntityTypeDefinition]             │
│  - relationship_types: List[RelationshipTypeDefinition] │
│  - Edge type map with ("Entity","Entity") catch-all     │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                 CODE GENERATION                          │
├─────────────────────────────────────────────────────────┤
│  schemas/<graph_name>/schema_v1.py                      │
│  - Pydantic models (with enriched docstrings)           │
│  - ENTITY_TYPES + EDGE_TYPES + EDGE_TYPE_MAP            │
│  - CoerciveBaseModel for scalar/list handling            │
└─────────────────────────────────────────────────────────┘

Learn More¶

None Mode - No schema, quick prototyping
LLM Mode - Automatic schema discovery with prompts
Ontology Mode - Strict ontology adherence
Hybrid Mode - LLM + ontology validation
Graph-Hybrid Mode - Semantic alignment (recommended)
Ontology-First Mode - Ontology as primary source