Building Knowledge Graphs¶
This guide covers the complete workflow for building knowledge graphs in Aletheia.
Workflow Overview¶
graph LR
A[Source Data] --> B[Parser]
B --> C[Episode Builder]
C --> D[Graphiti]
D --> E[Knowledge Graph] Step 1: Choose a Schema Mode¶
Before building, decide on a schema mode:
| Mode | Best For |
|---|---|
none | Quick prototyping, unknown data |
llm / inference | Data exploration, no ontology |
ontology | Strict formal domains |
hybrid | LLM + ontology string validation |
graph-hybrid | Production FTM data (recommended) |
ontology-first | Complete ontologies, can't lose concepts |
Step 2: Prepare Ontology (for graph-hybrid)¶
If using graph-hybrid mode, first load the ontology:
aletheia build-ontology-graph \
--use-case terrorist_orgs \
--knowledge-graph terrorist_orgs_ontology
This creates nodes for entity types and relationship types that guide extraction.
Step 3: Build the Graph¶
Basic Build¶
aletheia build-knowledge-graph \
--use-case terrorist_orgs \
--knowledge-graph terrorist_orgs \
--schema-mode graph-hybrid \
--ontology-graph terrorist_orgs_ontology
With Community Building¶
Communities cluster related entities for hierarchical queries:
aletheia build-knowledge-graph \
--use-case terrorist_orgs \
--knowledge-graph terrorist_orgs \
--schema-mode graph-hybrid \
--ontology-graph terrorist_orgs_ontology \
--build-communities
Reset and Rebuild¶
To start fresh:
aletheia build-knowledge-graph \
--use-case terrorist_orgs \
--knowledge-graph terrorist_orgs \
--schema-mode graph-hybrid \
--reset
Data Loss
--reset deletes all existing data in the graph.
Resume Interrupted Build¶
If a build is interrupted:
aletheia build-knowledge-graph \
--use-case terrorist_orgs \
--knowledge-graph terrorist_orgs \
--schema-mode graph-hybrid \
--resume
Aletheia tracks progress and resumes from the last successful episode.
Monitoring Progress¶
During ingestion, Aletheia displays:
Building knowledge graph...
[=====> ] 25% (250/1000 episodes)
Elapsed: 5m 23s | Remaining: ~16m
Errors: 3
Handling Errors¶
Some episodes may fail to ingest. View errors with:
Common error causes:
| Error | Cause | Solution |
|---|---|---|
| Invalid entity IDs | Cross-episode references | Expected - Graphiti resolves via entity resolution |
| Token limit exceeded | Episode too large | Split into smaller episodes |
| Rate limit | API throttling | Wait and resume |
Verifying the Graph¶
After building, verify the graph:
# Show statistics
aletheia show-graph --knowledge-graph terrorist_orgs
# Query directly (FalkorDB)
redis-cli GRAPH.QUERY terrorist_orgs "MATCH (n) RETURN labels(n), count(*)"
# Query directly (Neo4j)
cypher-shell -d terrorist_orgs "MATCH (n) RETURN labels(n), count(*)"
Best Practices¶
1. Start Small¶
Test with a subset first:
# In your parser, limit records for testing
def parse(self) -> Iterator[Entity]:
for i, entity in enumerate(self._parse_all()):
if i >= 100: # Test with 100 records
break
yield entity
2. Use graph-hybrid for FTM¶
The graph-hybrid mode is optimized for FTM data:
- Ontology provides structure
- LLM handles edge cases
- Semantic alignment improves consistency
3. Build Communities¶
Communities improve retrieval for:
- "What organizations are related to X?"
- "What are the main entity clusters?"
- High-level summarization queries
4. Monitor Token Usage¶
LLM extraction uses tokens. Monitor usage:
- Use
--verboseto see per-episode token counts - Consider smaller episodes for large documents
Learn More¶
- Schema Modes - Detailed mode comparison
- CLI Reference - All command options
- Troubleshooting - Common issues