Skip to content

LLM Mode

The llm mode (also called inference mode) uses a two-stage LLM process to automatically discover and generate a schema from your data.

Overview

Aspect Value
Ontology Required No
LLM Calls for Schema 2 (domain analysis + schema extraction)
Type Consistency Good
Setup Time Low
Best For Unknown data, no existing ontology

How It Works

LLM mode uses a two-stage meta-prompt architecture:

┌─────────────────────────────────────────────────────────┐
│                    STAGE 1                               │
│              Domain Analysis                             │
├─────────────────────────────────────────────────────────┤
│  Input: Sample data from parser                         │
│  LLM Role: "Knowledge Graph Architect"                  │
│  Output: Domain-specific extraction prompt              │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                    STAGE 2                               │
│              Schema Extraction                           │
├─────────────────────────────────────────────────────────┤
│  Input: Sample data + extraction prompt from Stage 1    │
│  LLM Role: "Schema Designer"                            │
│  Output: JSON schema definition                         │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                 CODE GENERATION                          │
├─────────────────────────────────────────────────────────┤
│  Output: schemas/<graph>/schema_v1.py                   │
│  - Pydantic entity models                               │
│  - Pydantic relationship models                         │
│  - ENTITY_TYPES and EDGE_TYPES dicts                    │
└─────────────────────────────────────────────────────────┘

Usage

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode llm

Or using the alias:

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode inference

Stage 1: Domain Analysis

The Meta-Prompt

Stage 1 uses a carefully designed meta-prompt that instructs the LLM to act as a "Knowledge Graph Architect":

# Stage 1: Domain Analysis Meta-Prompt

You are a senior knowledge graph architect. Your task is to analyze
sample data from a specific domain and design an optimal schema for
a knowledge graph that will enable powerful queries and insights.

## Your Role

Act as a **Domain Expert and Knowledge Graph Architect** who:
1. Understands the semantic meaning of the data, not just its structure
2. Identifies entities that would be valuable for graph queries
3. Discovers implicit entities hidden within property values
4. Designs relationships that connect entities meaningfully

What the LLM Analyzes

The meta-prompt guides the LLM to discover:

  1. Explicit entities: Directly represented in data
  2. Implicit entities: Hidden in property values (e.g., "Airbus A320" → Manufacturer: "Airbus")
  3. Derived entities: Useful for graph structure (geographic hierarchies, etc.)

Entity Selection Criteria

The LLM evaluates each potential entity:

  • Does it appear in multiple records? (linkability)
  • Would users want to query/filter by it?
  • Does it have meaningful relationships to other entities?
  • Is it a real-world thing with identity?

Stage 1 Output

The LLM returns a JSON object:

{
  "domain_description": "Sanctions and terrorist organization data from...",
  "key_entities_identified": ["Organization", "Sanction", "Person"],
  "key_relationships_identified": ["SANCTION", "HAS_ALIAS", "MEMBER_OF"],
  "entity_count_rationale": "These entities enable queries like...",
  "relationship_count_rationale": "These relationships connect...",
  "extraction_prompt": "Your task is to extract a JSON schema..."
}

Extraction Prompt Generation

The extraction_prompt field contains the complete prompt for Stage 2. It's customized for your specific domain and saved to:

prompts/dynamic/<graph_name>/extraction_prompt_v1.md

Stage 2: Schema Extraction

Input

Stage 2 receives: 1. The extraction prompt generated in Stage 1 2. Sample data from the parser

Output

The LLM returns a JSON schema definition:

{
  "entity_types": [
    {
      "type_name": "Organization",
      "docstring": "A corporation, government body, or other organization",
      "properties": [
        {
          "name": "alias",
          "type_annotation": "str | None",
          "description": "Alternative name for the organization"
        },
        {
          "name": "jurisdiction",
          "type_annotation": "str | None",
          "description": "Country or region of registration"
        }
      ]
    }
  ],
  "relationship_types": [
    {
      "type_name": "SANCTION",
      "docstring": "A sanction designation between entities",
      "source_entity": "Organization",
      "target_entity": "Organization"
    }
  ]
}

Console Output

During schema inference, you'll see:

📊 Stage 1: Domain Analysis
   Loading meta-prompt and sample data...
   📎 Detected 3 FK property patterns
   Calling LLM (claude-sonnet-4-20250514)...
   ✓ Domain: Sanctions and terrorist organization data...
   ✓ Entities identified: 3
   ✓ Relationships identified: 4
   ✓ Extraction prompt saved: prompts/dynamic/my_graph/

📋 Stage 2: Schema Extraction
   Using domain-specific extraction prompt...
   Calling LLM (claude-sonnet-4-20250514)...
   ✓ Entity types extracted: 3
   ✓ Relationship types extracted: 4

Generated Files

Schema File

schemas/<graph_name>/schema_v1.py:

"""Generated schema for knowledge graph."""
from pydantic import BaseModel, Field

# Entity Types

class Organization(BaseModel):
    """A corporation, government body, or other organization."""
    alias: str | None = Field(None, description="Alternative name")
    jurisdiction: str | None = None
    status: str | None = None

class Sanction(BaseModel):
    """A sanction designation."""
    authority: str | None = Field(None, description="Sanctioning authority")
    start_date: str | None = None
    end_date: str | None = None

# Relationship Types

class SanctionRel(BaseModel):
    """A sanction relationship."""
    pass

# Exports
ENTITY_TYPES = {
    "Organization": Organization,
    "Sanction": Sanction,
}

EDGE_TYPES = {
    "SANCTION": SanctionRel,
}

Extraction Prompt

prompts/dynamic/<graph_name>/extraction_prompt_v1.md:

The generated prompt is saved for: - Inspection and debugging - Reuse (skip Stage 1 on subsequent runs) - Version control

Metadata

schemas/<graph_name>/metadata.json:

{
  "use_case": "my_case",
  "knowledge_graph": "my_graph",
  "version": "v1",
  "generated_at": "2025-01-09T12:00:00Z",
  "schema_mode": "llm",
  "entity_types": ["Organization", "Sanction"],
  "relationship_types": ["SANCTION", "HAS_ALIAS"]
}

Skipping Stage 1

If you've already run Stage 1 and have an extraction prompt, you can skip it:

# Subsequent runs reuse the extraction prompt
aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode llm

You'll see:

⏭️  Skipping Stage 1: Reusing existing extraction prompt
   ✓ Loaded prompt from: prompts/dynamic/my_graph/

Foreign Key Detection

The schema inference engine automatically detects foreign key patterns in your data:

Pattern Example Detected As
*Entity suffix addressEntity, companyEntity FK to Address, Company
Known FK names holder, owner, entity FK relationships
ID-like values NK-xxx, Q12345 Entity references

If FK relationships are missing from Stage 1 output, the engine automatically retries with feedback.

Phase 4: Consolidation

After Stage 2, the schema passes through Phase 4 consolidation — a common step across all modes. An LLM reviews the complete schema to merge redundant types and normalize naming. Data-driven pruning also removes entity types not present in the parser's schema_distribution.

Pros and Cons

Advantages

  • No ontology required: Works with any data
  • Automatic discovery: LLM identifies relevant entities
  • Domain adaptation: Prompt is customized for your data
  • Prompt reuse: Generated prompt can be reused/modified
  • Phase 4 cleanup: Consolidation reduces type fragmentation

Disadvantages

  • Sample-dependent: Schema limited to what's in samples
  • May miss concepts: Important types not in samples won't be discovered
  • LLM variability: Different runs may produce different schemas (Phase 4 mitigates this)
  • No semantic alignment: No mapping to standard ontologies

When to Use

Use llm mode when:

  1. No ontology exists: You don't have a formal domain model
  2. Unknown data: You're exploring data you haven't seen before
  3. Quick setup: You want automatic schema without manual work
  4. Prototyping: Building initial version before refining

When NOT to Use

Avoid llm mode when:

  1. Formal ontology exists: Use ontology-first or graph-hybrid
  2. Need semantic alignment: Use graph-hybrid for ontology mapping
  3. Critical concepts known: Important types might be missed from samples
  4. FTM data: Use graph-hybrid for FollowTheMoney data

Customizing the Process

Modifying the Meta-Prompt

The meta-prompt is stored at:

prompts/static/meta_prompt_v1.md

You can modify it to: - Add domain-specific guidance - Change entity selection criteria - Adjust relationship naming conventions

Modifying the Extraction Prompt

After Stage 1, you can edit the generated extraction prompt:

prompts/dynamic/<graph_name>/extraction_prompt_v1.md

Then re-run Stage 2 to regenerate the schema.