LLM Mode¶

The llm mode (also called inference mode) uses a two-stage LLM process to automatically discover and generate a schema from your data.

Overview¶

Aspect	Value
Ontology Required	No
LLM Calls for Schema	2 (domain analysis + schema extraction)
Type Consistency	Good
Setup Time	Low
Best For	Unknown data, no existing ontology

How It Works¶

LLM mode uses a two-stage meta-prompt architecture:

┌─────────────────────────────────────────────────────────┐
│                    STAGE 1                               │
│              Domain Analysis                             │
├─────────────────────────────────────────────────────────┤
│  Input: Sample data from parser                         │
│  LLM Role: "Knowledge Graph Architect"                  │
│  Output: Domain-specific extraction prompt              │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                    STAGE 2                               │
│              Schema Extraction                           │
├─────────────────────────────────────────────────────────┤
│  Input: Sample data + extraction prompt from Stage 1    │
│  LLM Role: "Schema Designer"                            │
│  Output: JSON schema definition                         │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                 CODE GENERATION                          │
├─────────────────────────────────────────────────────────┤
│  Output: schemas/<graph>/schema_v1.py                   │
│  - Pydantic entity models                               │
│  - Pydantic relationship models                         │
│  - ENTITY_TYPES and EDGE_TYPES dicts                    │
└─────────────────────────────────────────────────────────┘

Usage¶

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode llm

Or using the alias:

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode inference

Stage 1: Domain Analysis¶

The Meta-Prompt¶

Stage 1 uses a carefully designed meta-prompt that instructs the LLM to act as a "Knowledge Graph Architect":

# Stage 1: Domain Analysis Meta-Prompt

You are a senior knowledge graph architect. Your task is to analyze
sample data from a specific domain and design an optimal schema for
a knowledge graph that will enable powerful queries and insights.

## Your Role

Act as a **Domain Expert and Knowledge Graph Architect** who:
1. Understands the semantic meaning of the data, not just its structure
2. Identifies entities that would be valuable for graph queries
3. Discovers implicit entities hidden within property values
4. Designs relationships that connect entities meaningfully

What the LLM Analyzes¶

The meta-prompt guides the LLM to discover:

Explicit entities: Directly represented in data
Implicit entities: Hidden in property values (e.g., "Airbus A320" → Manufacturer: "Airbus")
Derived entities: Useful for graph structure (geographic hierarchies, etc.)

Entity Selection Criteria¶

The LLM evaluates each potential entity:

Does it appear in multiple records? (linkability)
Would users want to query/filter by it?
Does it have meaningful relationships to other entities?
Is it a real-world thing with identity?

Stage 1 Output¶

The LLM returns a JSON object:

{
  "domain_description": "Sanctions and terrorist organization data from...",
  "key_entities_identified": ["Organization", "Sanction", "Person"],
  "key_relationships_identified": ["SANCTION", "HAS_ALIAS", "MEMBER_OF"],
  "entity_count_rationale": "These entities enable queries like...",
  "relationship_count_rationale": "These relationships connect...",
  "extraction_prompt": "Your task is to extract a JSON schema..."
}

Extraction Prompt Generation¶

The extraction_prompt field contains the complete prompt for Stage 2. It's customized for your specific domain and saved to:

prompts/dynamic/<graph_name>/extraction_prompt_v1.md

Stage 2: Schema Extraction¶

Input¶

Stage 2 receives: 1. The extraction prompt generated in Stage 1 2. Sample data from the parser

Output¶

The LLM returns a JSON schema definition:

{
  "entity_types": [
    {
      "type_name": "Organization",
      "docstring": "A corporation, government body, or other organization",
      "properties": [
        {
          "name": "alias",
          "type_annotation": "str | None",
          "description": "Alternative name for the organization"
        },
        {
          "name": "jurisdiction",
          "type_annotation": "str | None",
          "description": "Country or region of registration"
        }
      ]
    }
  ],
  "relationship_types": [
    {
      "type_name": "SANCTION",
      "docstring": "A sanction designation between entities",
      "source_entity": "Organization",
      "target_entity": "Organization"
    }
  ]
}

Console Output¶

During schema inference, you'll see:

📊 Stage 1: Domain Analysis
   Loading meta-prompt and sample data...
   📎 Detected 3 FK property patterns
   Calling LLM (claude-sonnet-4-20250514)...
   ✓ Domain: Sanctions and terrorist organization data...
   ✓ Entities identified: 3
   ✓ Relationships identified: 4
   ✓ Extraction prompt saved: prompts/dynamic/my_graph/

📋 Stage 2: Schema Extraction
   Using domain-specific extraction prompt...
   Calling LLM (claude-sonnet-4-20250514)...
   ✓ Entity types extracted: 3
   ✓ Relationship types extracted: 4

Generated Files¶

Schema File¶

schemas/<graph_name>/schema_v1.py:

"""Generated schema for knowledge graph."""
from pydantic import BaseModel, Field

# Entity Types

class Organization(BaseModel):
    """A corporation, government body, or other organization."""
    alias: str | None = Field(None, description="Alternative name")
    jurisdiction: str | None = None
    status: str | None = None

class Sanction(BaseModel):
    """A sanction designation."""
    authority: str | None = Field(None, description="Sanctioning authority")
    start_date: str | None = None
    end_date: str | None = None

# Relationship Types

class SanctionRel(BaseModel):
    """A sanction relationship."""
    pass

# Exports
ENTITY_TYPES = {
    "Organization": Organization,
    "Sanction": Sanction,
}

EDGE_TYPES = {
    "SANCTION": SanctionRel,
}

Extraction Prompt¶

prompts/dynamic/<graph_name>/extraction_prompt_v1.md:

The generated prompt is saved for: - Inspection and debugging - Reuse (skip Stage 1 on subsequent runs) - Version control

Metadata¶

schemas/<graph_name>/metadata.json:

{
  "use_case": "my_case",
  "knowledge_graph": "my_graph",
  "version": "v1",
  "generated_at": "2025-01-09T12:00:00Z",
  "schema_mode": "llm",
  "entity_types": ["Organization", "Sanction"],
  "relationship_types": ["SANCTION", "HAS_ALIAS"]
}

Skipping Stage 1¶

If you've already run Stage 1 and have an extraction prompt, you can skip it:

# Subsequent runs reuse the extraction prompt
aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --schema-mode llm

You'll see:

⏭️  Skipping Stage 1: Reusing existing extraction prompt
   ✓ Loaded prompt from: prompts/dynamic/my_graph/

Foreign Key Detection¶

The schema inference engine automatically detects foreign key patterns in your data:

Pattern	Example	Detected As
`*Entity` suffix	`addressEntity`, `companyEntity`	FK to Address, Company
Known FK names	`holder`, `owner`, `entity`	FK relationships
ID-like values	`NK-xxx`, `Q12345`	Entity references

If FK relationships are missing from Stage 1 output, the engine automatically retries with feedback.

Phase 4: Consolidation¶

After Stage 2, the schema passes through Phase 4 consolidation — a common step across all modes. An LLM reviews the complete schema to merge redundant types and normalize naming. Data-driven pruning also removes entity types not present in the parser's schema_distribution.

Pros and Cons¶

Advantages¶

No ontology required: Works with any data
Automatic discovery: LLM identifies relevant entities
Domain adaptation: Prompt is customized for your data
Prompt reuse: Generated prompt can be reused/modified
Phase 4 cleanup: Consolidation reduces type fragmentation

Disadvantages¶

Sample-dependent: Schema limited to what's in samples
May miss concepts: Important types not in samples won't be discovered
LLM variability: Different runs may produce different schemas (Phase 4 mitigates this)
No semantic alignment: No mapping to standard ontologies

When to Use¶

Use llm mode when:

No ontology exists: You don't have a formal domain model
Unknown data: You're exploring data you haven't seen before
Quick setup: You want automatic schema without manual work
Prototyping: Building initial version before refining

When NOT to Use¶

Avoid llm mode when:

Formal ontology exists: Use ontology-first or graph-hybrid
Need semantic alignment: Use graph-hybrid for ontology mapping
Critical concepts known: Important types might be missed from samples
FTM data: Use graph-hybrid for FollowTheMoney data

Customizing the Process¶

Modifying the Meta-Prompt¶

The meta-prompt is stored at:

prompts/static/meta_prompt_v1.md

You can modify it to: - Add domain-specific guidance - Change entity selection criteria - Adjust relationship naming conventions

Modifying the Extraction Prompt¶

After Stage 1, you can edit the generated extraction prompt:

prompts/dynamic/<graph_name>/extraction_prompt_v1.md

Then re-run Stage 2 to regenerate the schema.

None Mode - No schema at all
Graph-Hybrid Mode - LLM + ontology alignment
Overview - Comparison of all modes