LLM Mode¶
The llm mode (also called inference mode) uses a two-stage LLM process to automatically discover and generate a schema from your data.
Overview¶
| Aspect | Value |
|---|---|
| Ontology Required | No |
| LLM Calls for Schema | 2 (domain analysis + schema extraction) |
| Type Consistency | Good |
| Setup Time | Low |
| Best For | Unknown data, no existing ontology |
How It Works¶
LLM mode uses a two-stage meta-prompt architecture:
┌─────────────────────────────────────────────────────────┐
│ STAGE 1 │
│ Domain Analysis │
├─────────────────────────────────────────────────────────┤
│ Input: Sample data from parser │
│ LLM Role: "Knowledge Graph Architect" │
│ Output: Domain-specific extraction prompt │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ STAGE 2 │
│ Schema Extraction │
├─────────────────────────────────────────────────────────┤
│ Input: Sample data + extraction prompt from Stage 1 │
│ LLM Role: "Schema Designer" │
│ Output: JSON schema definition │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ CODE GENERATION │
├─────────────────────────────────────────────────────────┤
│ Output: schemas/<graph>/schema_v1.py │
│ - Pydantic entity models │
│ - Pydantic relationship models │
│ - ENTITY_TYPES and EDGE_TYPES dicts │
└─────────────────────────────────────────────────────────┘
Usage¶
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode llm
Or using the alias:
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode inference
Stage 1: Domain Analysis¶
The Meta-Prompt¶
Stage 1 uses a carefully designed meta-prompt that instructs the LLM to act as a "Knowledge Graph Architect":
# Stage 1: Domain Analysis Meta-Prompt
You are a senior knowledge graph architect. Your task is to analyze
sample data from a specific domain and design an optimal schema for
a knowledge graph that will enable powerful queries and insights.
## Your Role
Act as a **Domain Expert and Knowledge Graph Architect** who:
1. Understands the semantic meaning of the data, not just its structure
2. Identifies entities that would be valuable for graph queries
3. Discovers implicit entities hidden within property values
4. Designs relationships that connect entities meaningfully
What the LLM Analyzes¶
The meta-prompt guides the LLM to discover:
- Explicit entities: Directly represented in data
- Implicit entities: Hidden in property values (e.g., "Airbus A320" → Manufacturer: "Airbus")
- Derived entities: Useful for graph structure (geographic hierarchies, etc.)
Entity Selection Criteria¶
The LLM evaluates each potential entity:
- Does it appear in multiple records? (linkability)
- Would users want to query/filter by it?
- Does it have meaningful relationships to other entities?
- Is it a real-world thing with identity?
Stage 1 Output¶
The LLM returns a JSON object:
{
"domain_description": "Sanctions and terrorist organization data from...",
"key_entities_identified": ["Organization", "Sanction", "Person"],
"key_relationships_identified": ["SANCTION", "HAS_ALIAS", "MEMBER_OF"],
"entity_count_rationale": "These entities enable queries like...",
"relationship_count_rationale": "These relationships connect...",
"extraction_prompt": "Your task is to extract a JSON schema..."
}
Extraction Prompt Generation¶
The extraction_prompt field contains the complete prompt for Stage 2. It's customized for your specific domain and saved to:
Stage 2: Schema Extraction¶
Input¶
Stage 2 receives: 1. The extraction prompt generated in Stage 1 2. Sample data from the parser
Output¶
The LLM returns a JSON schema definition:
{
"entity_types": [
{
"type_name": "Organization",
"docstring": "A corporation, government body, or other organization",
"properties": [
{
"name": "alias",
"type_annotation": "str | None",
"description": "Alternative name for the organization"
},
{
"name": "jurisdiction",
"type_annotation": "str | None",
"description": "Country or region of registration"
}
]
}
],
"relationship_types": [
{
"type_name": "SANCTION",
"docstring": "A sanction designation between entities",
"source_entity": "Organization",
"target_entity": "Organization"
}
]
}
Console Output¶
During schema inference, you'll see:
📊 Stage 1: Domain Analysis
Loading meta-prompt and sample data...
📎 Detected 3 FK property patterns
Calling LLM (claude-sonnet-4-20250514)...
✓ Domain: Sanctions and terrorist organization data...
✓ Entities identified: 3
✓ Relationships identified: 4
✓ Extraction prompt saved: prompts/dynamic/my_graph/
📋 Stage 2: Schema Extraction
Using domain-specific extraction prompt...
Calling LLM (claude-sonnet-4-20250514)...
✓ Entity types extracted: 3
✓ Relationship types extracted: 4
Generated Files¶
Schema File¶
schemas/<graph_name>/schema_v1.py:
"""Generated schema for knowledge graph."""
from pydantic import BaseModel, Field
# Entity Types
class Organization(BaseModel):
"""A corporation, government body, or other organization."""
alias: str | None = Field(None, description="Alternative name")
jurisdiction: str | None = None
status: str | None = None
class Sanction(BaseModel):
"""A sanction designation."""
authority: str | None = Field(None, description="Sanctioning authority")
start_date: str | None = None
end_date: str | None = None
# Relationship Types
class SanctionRel(BaseModel):
"""A sanction relationship."""
pass
# Exports
ENTITY_TYPES = {
"Organization": Organization,
"Sanction": Sanction,
}
EDGE_TYPES = {
"SANCTION": SanctionRel,
}
Extraction Prompt¶
prompts/dynamic/<graph_name>/extraction_prompt_v1.md:
The generated prompt is saved for: - Inspection and debugging - Reuse (skip Stage 1 on subsequent runs) - Version control
Metadata¶
schemas/<graph_name>/metadata.json:
{
"use_case": "my_case",
"knowledge_graph": "my_graph",
"version": "v1",
"generated_at": "2025-01-09T12:00:00Z",
"schema_mode": "llm",
"entity_types": ["Organization", "Sanction"],
"relationship_types": ["SANCTION", "HAS_ALIAS"]
}
Skipping Stage 1¶
If you've already run Stage 1 and have an extraction prompt, you can skip it:
# Subsequent runs reuse the extraction prompt
aletheia build-knowledge-graph \
--use-case my_case \
--knowledge-graph my_graph \
--schema-mode llm
You'll see:
⏭️ Skipping Stage 1: Reusing existing extraction prompt
✓ Loaded prompt from: prompts/dynamic/my_graph/
Foreign Key Detection¶
The schema inference engine automatically detects foreign key patterns in your data:
| Pattern | Example | Detected As |
|---|---|---|
*Entity suffix | addressEntity, companyEntity | FK to Address, Company |
| Known FK names | holder, owner, entity | FK relationships |
| ID-like values | NK-xxx, Q12345 | Entity references |
If FK relationships are missing from Stage 1 output, the engine automatically retries with feedback.
Phase 4: Consolidation¶
After Stage 2, the schema passes through Phase 4 consolidation — a common step across all modes. An LLM reviews the complete schema to merge redundant types and normalize naming. Data-driven pruning also removes entity types not present in the parser's schema_distribution.
Pros and Cons¶
Advantages¶
- No ontology required: Works with any data
- Automatic discovery: LLM identifies relevant entities
- Domain adaptation: Prompt is customized for your data
- Prompt reuse: Generated prompt can be reused/modified
- Phase 4 cleanup: Consolidation reduces type fragmentation
Disadvantages¶
- Sample-dependent: Schema limited to what's in samples
- May miss concepts: Important types not in samples won't be discovered
- LLM variability: Different runs may produce different schemas (Phase 4 mitigates this)
- No semantic alignment: No mapping to standard ontologies
When to Use¶
Use llm mode when:
- No ontology exists: You don't have a formal domain model
- Unknown data: You're exploring data you haven't seen before
- Quick setup: You want automatic schema without manual work
- Prototyping: Building initial version before refining
When NOT to Use¶
Avoid llm mode when:
- Formal ontology exists: Use
ontology-firstorgraph-hybrid - Need semantic alignment: Use
graph-hybridfor ontology mapping - Critical concepts known: Important types might be missed from samples
- FTM data: Use
graph-hybridfor FollowTheMoney data
Customizing the Process¶
Modifying the Meta-Prompt¶
The meta-prompt is stored at:
You can modify it to: - Add domain-specific guidance - Change entity selection criteria - Adjust relationship naming conventions
Modifying the Extraction Prompt¶
After Stage 1, you can edit the generated extraction prompt:
Then re-run Stage 2 to regenerate the schema.
Related¶
- None Mode - No schema at all
- Graph-Hybrid Mode - LLM + ontology alignment
- Overview - Comparison of all modes