Architecture¶

Aletheia is a GraphRAG evaluation framework and knowledge graph builder. This document describes its architecture and how components interact.

High-Level Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                         Data Sources                             │
│              (FTM JSON, MuSiQue, Custom formats)                 │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                          Use Cases                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │    Parser    │  │   Episode    │  │      Ontology        │  │
│  │              │──│   Builder    │  │      Loader          │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                       Aletheia Core                              │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌─────────────┐ │
│  │   Config   │ │   Graph    │ │  Schema    │ │ Evaluation  │ │
│  │  (DB, LLM) │ │  Builder   │ │ Inference  │ │   (RAGAS)   │ │
│  └────────────┘ └────────────┘ └────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Graphiti (Fork)                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │   Episode    │  │    Entity    │  │       Search         │  │
│  │  Processing  │  │  Resolution  │  │        API           │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
│  ┌──────────────┐  ┌──────────────────────────────────────────┐ │
│  │  Community   │  │           MCP Server                     │ │
│  │  Detection   │  │  (15 tools, self-describing connectors)  │ │
│  └──────────────┘  └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Graph Database                              │
│                    (Neo4j / FalkorDB)                            │
└─────────────────────────────────────────────────────────────────┘

Project Structure¶

aletheia/
├── aletheia/
│   ├── cli/                 # CLI commands
│   │   ├── main.py          # Entry point
│   │   ├── build.py         # Build commands
│   │   └── evaluate.py      # Evaluation commands
│   ├── core/
│   │   ├── config/          # Database + LLM configuration
│   │   ├── episodes/        # Episode builder registry
│   │   ├── evaluation/      # RAGAS integration + grounding verification
│   │   ├── graph/           # Graph builder
│   │   ├── ontology/        # GenericOntologyLoader, ModelingProfile
│   │   ├── parsing/         # Base parser
│   │   ├── schema/          # Schema inference engine (7 modes)
│   │   └── tracking/        # Ingestion progress
│   └── ...
├── use_cases/
│   ├── anticorruption/          # EU financial sanctions (FTM)
│   ├── terrorist_orgs/          # Multi-authority FTO designations (FTM)
│   ├── aviation_safety/         # European aviation incidents
│   ├── safety_recommendations/  # EASA safety recommendations
│   ├── airworthiness_directives/# EASA airworthiness directives
│   ├── operation_tango/         # Multi-dataset investigation (FTM)
│   └── evaluation/              # MuSiQue evaluation benchmark
├── schemas/                 # Auto-generated schemas (never edit manually)
├── prompts/                 # Dynamic extraction prompts
└── docs/                    # Documentation (MkDocs Material)

Generated files

Files in schemas/ and prompts/ are regenerated by schema inference on every run. Never edit them manually — fix the inputs (parser, ontology, inference code) instead.

Component Details¶

CLI Layer¶

The CLI (aletheia/cli/) provides commands for building graphs, running evaluations, and inspecting state:

main.py — Click command group registration
build.py — build-ontology-graph, build-knowledge-graph, list-use-cases, list-graphs, show-graph
evaluate.py — evaluate-ragas with grounding modes and community search

Core Layer¶

Config¶

Dual LLM configuration (reasoning model + fast model), database driver creation, embedding model setup.

Schema Inference¶

The schema inference engine (aletheia/core/schema/) supports 7 modes and a common Phase 4 consolidation step:

Mode	Primary Source	LLM Role
`none`	Graphiti defaults	None
`llm` / `inference`	LLM	Full inference
`ontology`	Ontology file	None
`hybrid`	LLM + ontology validation	Inference + validation
`graph-hybrid`	LLM + ontology graph	Inference + semantic alignment
`ontology-first`	Ontology + LLM	Enhancement only

Key components:

inference.py — Main engine: mode dispatch, Phase 4 consolidation, data-driven pruning
models.py — EntityTypeDefinition, RelationshipTypeDefinition, enriched docstrings, edge type map with ("Entity", "Entity") catch-all
coercion.py — CoerciveBaseModel that fixes LLM scalar/list type mismatches

Ontology¶

GenericOntologyLoader — Loads TTL/OWL ontologies, classifies classes using transitive ancestry (Entity, Relationship, Abstract), extracts non-reified object properties as relationship types
ModelingProfile — Optional explicit classification hints per ontology

Evaluation¶

RAGAS metrics — Context Precision, Context Recall, Faithfulness, Answer Similarity
Grounding verification — Three modes (strict, lenient, off) to detect parametric knowledge leakage
Community search — Optional hierarchical context from entity clusters built via label propagation

Graph Builder¶

Orchestrates the ingestion pipeline: parse → build episodes → infer schema → call Graphiti add_episode. Supports --build-communities for community detection and --resume for resuming interrupted ingestion.

Use Case Layer¶

Each use case (use_cases/<name>/) is self-contained:

use_cases/terrorist_orgs/
├── __init__.py              # Registration
├── parser.py                # Data parser
├── episode_builder.py       # Markdown episode builder
├── ontology/                # TTL/OWL files
│   └── followthemoney.ttl
├── data/                    # Source data
│   └── entities.ftm.json
├── evaluation_questions.json # RAGAS evaluation questions
└── mcp_config.yaml          # MCP server configuration

Graphiti Integration¶

Aletheia uses a maintained fork of Graphiti (david-morales/aletheia-graphiti, branch: aletheia) that includes 16 cherry-picked upstream PRs and 12 custom fixes for entity extraction, node dedup, and edge resolution.

Graphiti handles:

Episode processing — Text → entities + relationships
Entity resolution — Deduplication and merging
Search API — Semantic and graph-based search (BFS, cosine similarity, community)
Community detection — Label propagation clustering with hierarchical summaries

MCP Server¶

The Graphiti fork includes an MCP server with 15 tools across 6 groups:

Group	Tools
Semantic Discovery	`search`, `explore_node`
Schema & Ontology	`get_schema`, `search_ontology`, `explore_ontology`
Graph Profiling	`profile_graph` (property coverage, language detection, relationship validation)
Cypher Analytics	`run_cypher` (read-only, 4-stage security pipeline)
Community Intelligence	`build_communities`
Data Management	`add_memory`, `get_episodes`, `get_episode_context`, `delete_entity_edge`, `delete_episode`, `clear_graph`, `get_status`

Each connector is self-describing: a DomainProfile auto-discovers entity types, edge types, counts, and samples at startup, generating domain-specific tool descriptions and MCP resources. Four domain configs are defined in use_cases/<name>/mcp_config.yaml with a shared base config.

Data Flow¶

Ingestion¶

1. Parser.parse()                → Iterator[Entity]
2. episode_builder(entity)       → markdown text
3. SchemaInferenceEngine.extract → entity_types, edge_types (mode-specific + Phase 4)
4. graphiti.add_episode(...)     → graph updates (entities, edges, communities)
5. Tracking records progress     → resume support

Evaluation¶

1. Load questions from JSON
2. For each question:
   a. graphiti.search_(query)    → context (nodes + edges + communities)
   b. LLM generates answer       → grounded response with citations
   c. Grounding verification     → accept/reject based on evidence
   d. RAGAS metrics calculation  → precision, recall, faithfulness, similarity
3. Aggregate and output results

Key Interfaces¶

Parser¶

class Parser(Protocol):
    def __init__(self, data_dir: Path): ...
    def parse(self) -> Iterator[Entity]: ...
    @property
    def schema_distribution(self) -> dict[str, int]:
        """Entity type counts in data — drives data-driven pruning."""
        ...

Episode Builder¶

def episode_builder(entity: Entity) -> str:
    """Convert entity to markdown episode."""
    ...

register_episode_builder(
    "my_case",
    episode_builder,
    source_description="Description of data source",
)

Ontology Loader¶

class OntologyLoader(Protocol):
    def __init__(self, ontology_dir: Path): ...
    def load(self) -> Ontology: ...

Learn More¶

Creating Use Cases — Build your own use case
Schema Modes — Schema inference modes and decision tree
MCP Connectors — Domain-aware MCP servers
Contributing — Development workflow