Aletheia-Graphiti Fork¶
Aletheia maintains a fork of Graphiti called aletheia-graphiti (david-morales/aletheia-graphiti). This page explains why the fork exists, what it changes, and how it stays in sync with upstream.
Why a Fork?¶
Graphiti provides the core knowledge graph engine that Aletheia relies on: LLM-driven entity extraction, deduplication, graph storage, and semantic search. However, during development we encountered three categories of limitations:
-
FalkorDB support gaps — Custom edge types (typed relationships like
SANCTION,OWNERSHIP) didn't work correctly with FalkorDB's BFS search, BM25 fulltext search, or fulltext index creation. We submitted fixes upstream but they remained unmerged. -
Entity extraction and deduplication issues — The LLM extraction pipeline had prompt engineering problems that caused entity absorption, duplicate nodes, dropped edges, and non-deterministic relationship naming. These required changes to core extraction, deduplication, and edge resolution logic.
-
MCP server surface too limited — Graphiti's built-in MCP server exposed only 5 basic tools. The underlying library supports 16 search recipes, BFS traversal, temporal filters, community clustering, and multi-graph queries — none of which reached the MCP surface.
Rather than waiting for upstream adoption, we created a maintained fork that consolidates our fixes, extends the MCP server, and can selectively cherry-pick upstream improvements.
Baseline¶
The fork was created from upstream Graphiti at commit 45a3d92 (February 2026), corresponding to the v0.27.x series. The current fork version is v0.27.0rc2.
Repository structure¶
| Branch | Purpose |
|---|---|
main | Tracks upstream getzep/graphiti:main (frozen at fork point) |
aletheia | Working branch: upstream base + fixes + MCP extensions |
Aletheia's pyproject.toml points directly at the fork:
What Changed¶
The fork contains ~85 commits on top of upstream, organized into four areas.
1. Upstream cherry-picks (13 patches)¶
Bug fixes and improvements from upstream PRs that were merged to Graphiti's main branch after our fork point. These were cherry-picked to keep the fork current with important fixes:
| Cherry-pick | Description |
|---|---|
| #1170 | Fix handle_multiple_group_ids — clone FalkorDB driver for single group_id |
| #1085 | Flatten community update results from semaphore_gather |
| #1086 | Add iteration cap and oscillation detection to label propagation |
| #816 | Guard empty group_ids list in search queries |
| #1163 | Add summary embedding search for entity nodes |
| #764 | Respect config.max_tokens in OpenAI client |
| #1176 | Store queue worker task references to prevent garbage collection |
| #1102 | Improve edge dedup prompt for voice/format equivalence |
| #1131 | Add inline attribute extraction to ExtractedEntity — reverted (caused entity absorption, see below) |
| #1130 | Add orphan node detection and cleanup |
| — | Rate limit exponential backoff for OpenAI API |
| — | Batch community projection queries (eliminate N+1 pattern) |
| — | Deduplicate BFS results at traversal level |
2. FalkorDB fixes (8 commits)¶
Custom edge type support was broken across FalkorDB's search stack. These fixes enable Graphiti's full search capabilities with typed relationships:
| Fix | Impact |
|---|---|
| Enforce custom edge types for FalkorDB | Typed edges stored with correct relationship labels |
| Support custom edge types in BFS search | Graph traversal follows typed relationships |
| Support custom edge types in BM25 fulltext search | Keyword search works across typed edges |
| Auto-create fulltext indexes for custom edge types | New relationship types get searchable automatically |
| Make edge BFS search bidirectional | Traversal works in both directions |
| Two-phase search for edge BFS | Edge BFS can use nodes found by other search methods |
| FalkorDB username authentication | Enables authenticated FalkorDB connections |
3. Entity extraction and deduplication fixes (12 commits)¶
These address prompt engineering pitfalls and algorithmic issues in Graphiti's LLM pipeline:
Entity extraction:
- Reverted
ExtractedEntity.attributesfield — The upstream attribute extraction caused the LLM to absorb secondary entities as key-value attributes instead of extracting them as separate nodes. Evidence: fork extracted 1 entity/episode vs. upstream's 2-5. - Disabled content chunking for text episodes — Chunking splits episodes into independent LLM calls, losing cross-section naming context. Entity names became inconsistent across chunks.
- Reverted "strip qualifiers" prompt — An attempted optimization to make the LLM strip contextual descriptions caused it to remove valuable identifiers like ICAO codes from airport names (7 airports with codes dropped to 0).
Node deduplication:
- Exact name match before entropy gate — Short names like "Spain" (5 characters) have low entropy, which deferred them to unreliable LLM dedup. Now exact matches are resolved deterministically before the entropy check.
- Containment matching — Names that contain other names (e.g., "European Union" vs "EU") are caught during dedup resolution.
- Same-batch cross-referencing — Alias duplicates from a single episode are now detected within the same dedup batch.
- Entity type context in dedup — The dedup LLM now sees entity type descriptions for existing nodes too, not just newly extracted ones.
Edge extraction:
- Removed pair pre-assignment — Upstream pre-assigned entity pairs to edge extraction, silently dropping valid edges from overlapping chunks.
- Case-insensitive entity name resolution — Three-level fallback (exact, lowercase, normalized) prevents edge loss from case mismatches.
- Constrained edge types to schema — When schema types are provided, the extraction prompt now says "must be one of these types, do NOT invent new ones." Without this, the LLM non-deterministically invented types (
HAS_OPERATORvsOPERATED_BYbetween runs).
4. Extended MCP server (~50 commits)¶
The largest area of work. The MCP server was extended from 5 tools to 15, organized by capability:
Search tools¶
| Tool | Type | Description |
|---|---|---|
search | New | Unified power search with 16 recipe combinations (nodes/edges/communities/combined), multiple rerankers, type filters, temporal filters, BFS traversal, multi-graph queries |
explore_node | New | Land-and-expand from a named entity — resolves name to UUID, then expands via BFS with proximity ranking |
search_ontology | New | Search the companion ontology graph for class and property definitions |
explore_ontology | New | Explore ontology structure from a specific class |
These replaced three narrower tools: search_nodes, search_memory_facts, and get_entity_edge.
Cypher analytics tools¶
| Tool | Type | Description |
|---|---|---|
get_schema | New | Structural schema discovery with dirty-flag caching |
run_cypher | New | Read-only Cypher execution via GRAPH.RO_QUERY |
The Cypher pipeline has a 4-stage sanitization architecture:
- Stage 1 — LLM fixups: Smart quotes, code block stripping, identifier quoting, RETURN injection
- Stage 2 — FalkorDB dialect: Reject unsupported features (APOC,
EXISTS{}, map projections), auto-fix lossless transformations (date()to strings,toLower) - Stage 3 — Security whitelist: Fail-safe read-only enforcement — only known safe keywords are allowed
- Stage 4 — Safety injection:
LIMIT N+1for truncation detection
Graph profiling tool¶
| Tool | Type | Description |
|---|---|---|
profile_graph | New | Server-side property sampling, relationship validation, and language detection in a single call. Returns coverage percentages, sample values, detected languages, relationship degree statistics, and sample traversal paths. |
This tool runs all profiling queries server-side (one Cypher query per entity type and relationship type), avoiding the 10+ MCP round-trips that would be needed from the client. Results feed into Aletheia's autonomous knowledge discovery (Phase 0), which builds a CapabilityModel with property profiles, relationship validation status, and multilingual field detection — enabling the reasoning engine's planner to generate precise queries without hardcoded domain knowledge.
Data and admin tools¶
| Tool | Type | Description |
|---|---|---|
add_memory | Enhanced | Added bulk episode ingestion support alongside existing single-episode interface |
get_episode_context | New | Retrieve entities and relationships extracted from specific episodes |
build_communities | New | Trigger label propagation clustering |
get_episodes | Kept | List recent episodes |
delete_entity_edge / delete_episode | Kept | Remove relationships or episodes |
clear_graph / get_status | Kept | Graph management and health checks |
Infrastructure improvements¶
- Domain-aware tool descriptions —
DomainProfileauto-discovers entity types, relationship types, counts, and samples from the graph at startup. Tool descriptions are dynamically generated with domain context. - Ontology graph integration — Each MCP server optionally connects to a companion ontology graph for class/property lookups.
- Layered configuration —
base:reference support for shared config (LLM, database, embedder) with domain-specific overlays. - Dynamic tool registration — 7 main tools registered via
mcp.add_tool()with domain-aware descriptions; 7 secondary tools use static decorators. - MCP resources —
graphiti://domain_summary,graphiti://entity_catalog,graphiti://relationship_typesfor client introspection.
Test coverage¶
The fork adds 12+ test files with 90+ tests covering all fork-specific fixes, plus 200+ MCP server tests across 7 test files:
| Test file | Tests | Coverage |
|---|---|---|
test_tools.py | Unit tests for all 15 MCP tools | Tool behavior |
test_config_merge.py | 10 tests | Layered config deep merge |
test_domain_profile.py | 15 tests | Graph introspection and rendering |
test_tool_descriptions.py | 12 tests | Dynamic description generation |
test_cypher.py | 95 tests | 4-stage pipeline, formatter, error handling |
test_cypher_regression.py | Regression fixtures | Grows as edge cases are discovered |
test_graph_profiler.py | 20 tests | Property profiling, language detection, relationship validation |
All tests pass alongside existing upstream tests.
Upstream sync strategy¶
The fork tracks upstream manually. When Graphiti ships changes we want:
- Fetch upstream
maininto the fork'smainbranch - Cherry-pick individual commits onto the
aletheiabranch - Resolve conflicts and run the full test suite
This approach avoids merge noise from upstream changes we don't need while keeping the door open for selective adoption.
Lessons learned¶
Several prompt engineering anti-patterns were discovered during fork development:
- "Strip qualifiers" is too aggressive — The LLM cannot reliably distinguish valuable identifiers (ICAO codes, registration numbers) from noise. Telling it to "strip contextual qualifiers" caused data loss.
- "Otherwise derive" creates non-determinism — When schema types are provided, the extraction prompt must constrain to only those types. Any escape hatch ("otherwise derive a type") causes the LLM to invent different type names between runs.
- Entity names should never embed type names — Compound names like
"SafetyRecommendation EASA-SR-2024-523-01"confuse the LLM during edge extraction. The name should be just the identifier; the type comes from the schema. - Metadata context must be consistent across the pipeline — Entity type descriptions visible during extraction must also be visible during deduplication, or the LLM makes inconsistent judgments.