Skip to content

Architecture

Aletheia is a GraphRAG evaluation framework and knowledge graph builder. This document describes its architecture and how components interact.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Data Sources                             │
│              (FTM JSON, MuSiQue, Custom formats)                 │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                          Use Cases                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │    Parser    │  │   Episode    │  │      Ontology        │  │
│  │              │──│   Builder    │  │      Loader          │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                       Aletheia Core                              │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌─────────────┐ │
│  │   Config   │ │   Graph    │ │  Schema    │ │ Evaluation  │ │
│  │  (DB, LLM) │ │  Builder   │ │ Inference  │ │   (RAGAS)   │ │
│  └────────────┘ └────────────┘ └────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                    Graphiti (Fork)                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │   Episode    │  │    Entity    │  │       Search         │  │
│  │  Processing  │  │  Resolution  │  │        API           │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
│  ┌──────────────┐  ┌──────────────────────────────────────────┐ │
│  │  Community   │  │           MCP Server                     │ │
│  │  Detection   │  │  (15 tools, self-describing connectors)  │ │
│  └──────────────┘  └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                      Graph Database                              │
│                    (Neo4j / FalkorDB)                            │
└─────────────────────────────────────────────────────────────────┘

Project Structure

aletheia/
├── aletheia/
│   ├── cli/                 # CLI commands
│   │   ├── main.py          # Entry point
│   │   ├── build.py         # Build commands
│   │   └── evaluate.py      # Evaluation commands
│   ├── core/
│   │   ├── config/          # Database + LLM configuration
│   │   ├── episodes/        # Episode builder registry
│   │   ├── evaluation/      # RAGAS integration + grounding verification
│   │   ├── graph/           # Graph builder
│   │   ├── ontology/        # GenericOntologyLoader, ModelingProfile
│   │   ├── parsing/         # Base parser
│   │   ├── schema/          # Schema inference engine (7 modes)
│   │   └── tracking/        # Ingestion progress
│   └── ...
├── use_cases/
│   ├── anticorruption/          # EU financial sanctions (FTM)
│   ├── terrorist_orgs/          # Multi-authority FTO designations (FTM)
│   ├── aviation_safety/         # European aviation incidents
│   ├── safety_recommendations/  # EASA safety recommendations
│   ├── airworthiness_directives/# EASA airworthiness directives
│   ├── operation_tango/         # Multi-dataset investigation (FTM)
│   └── evaluation/              # MuSiQue evaluation benchmark
├── schemas/                 # Auto-generated schemas (never edit manually)
├── prompts/                 # Dynamic extraction prompts
└── docs/                    # Documentation (MkDocs Material)

Generated files

Files in schemas/ and prompts/ are regenerated by schema inference on every run. Never edit them manually — fix the inputs (parser, ontology, inference code) instead.

Component Details

CLI Layer

The CLI (aletheia/cli/) provides commands for building graphs, running evaluations, and inspecting state:

  • main.py — Click command group registration
  • build.pybuild-ontology-graph, build-knowledge-graph, list-use-cases, list-graphs, show-graph
  • evaluate.pyevaluate-ragas with grounding modes and community search

Core Layer

Config

Dual LLM configuration (reasoning model + fast model), database driver creation, embedding model setup.

Schema Inference

The schema inference engine (aletheia/core/schema/) supports 7 modes and a common Phase 4 consolidation step:

Mode Primary Source LLM Role
none Graphiti defaults None
llm / inference LLM Full inference
ontology Ontology file None
hybrid LLM + ontology validation Inference + validation
graph-hybrid LLM + ontology graph Inference + semantic alignment
ontology-first Ontology + LLM Enhancement only

Key components:

  • inference.py — Main engine: mode dispatch, Phase 4 consolidation, data-driven pruning
  • models.pyEntityTypeDefinition, RelationshipTypeDefinition, enriched docstrings, edge type map with ("Entity", "Entity") catch-all
  • coercion.pyCoerciveBaseModel that fixes LLM scalar/list type mismatches

Ontology

  • GenericOntologyLoader — Loads TTL/OWL ontologies, classifies classes using transitive ancestry (Entity, Relationship, Abstract), extracts non-reified object properties as relationship types
  • ModelingProfile — Optional explicit classification hints per ontology

Evaluation

  • RAGAS metrics — Context Precision, Context Recall, Faithfulness, Answer Similarity
  • Grounding verification — Three modes (strict, lenient, off) to detect parametric knowledge leakage
  • Community search — Optional hierarchical context from entity clusters built via label propagation

Graph Builder

Orchestrates the ingestion pipeline: parse → build episodes → infer schema → call Graphiti add_episode. Supports --build-communities for community detection and --resume for resuming interrupted ingestion.

Use Case Layer

Each use case (use_cases/<name>/) is self-contained:

use_cases/terrorist_orgs/
├── __init__.py              # Registration
├── parser.py                # Data parser
├── episode_builder.py       # Markdown episode builder
├── ontology/                # TTL/OWL files
│   └── followthemoney.ttl
├── data/                    # Source data
│   └── entities.ftm.json
├── evaluation_questions.json # RAGAS evaluation questions
└── mcp_config.yaml          # MCP server configuration

Graphiti Integration

Aletheia uses a maintained fork of Graphiti (david-morales/aletheia-graphiti, branch: aletheia) that includes 16 cherry-picked upstream PRs and 12 custom fixes for entity extraction, node dedup, and edge resolution.

Graphiti handles:

  1. Episode processing — Text → entities + relationships
  2. Entity resolution — Deduplication and merging
  3. Search API — Semantic and graph-based search (BFS, cosine similarity, community)
  4. Community detection — Label propagation clustering with hierarchical summaries

MCP Server

The Graphiti fork includes an MCP server with 15 tools across 6 groups:

Group Tools
Semantic Discovery search, explore_node
Schema & Ontology get_schema, search_ontology, explore_ontology
Graph Profiling profile_graph (property coverage, language detection, relationship validation)
Cypher Analytics run_cypher (read-only, 4-stage security pipeline)
Community Intelligence build_communities
Data Management add_memory, get_episodes, get_episode_context, delete_entity_edge, delete_episode, clear_graph, get_status

Each connector is self-describing: a DomainProfile auto-discovers entity types, edge types, counts, and samples at startup, generating domain-specific tool descriptions and MCP resources. Four domain configs are defined in use_cases/<name>/mcp_config.yaml with a shared base config.

Data Flow

Ingestion

1. Parser.parse()                → Iterator[Entity]
2. episode_builder(entity)       → markdown text
3. SchemaInferenceEngine.extract → entity_types, edge_types (mode-specific + Phase 4)
4. graphiti.add_episode(...)     → graph updates (entities, edges, communities)
5. Tracking records progress     → resume support

Evaluation

1. Load questions from JSON
2. For each question:
   a. graphiti.search_(query)    → context (nodes + edges + communities)
   b. LLM generates answer       → grounded response with citations
   c. Grounding verification     → accept/reject based on evidence
   d. RAGAS metrics calculation  → precision, recall, faithfulness, similarity
3. Aggregate and output results

Key Interfaces

Parser

class Parser(Protocol):
    def __init__(self, data_dir: Path): ...
    def parse(self) -> Iterator[Entity]: ...
    @property
    def schema_distribution(self) -> dict[str, int]:
        """Entity type counts in data — drives data-driven pruning."""
        ...

Episode Builder

def episode_builder(entity: Entity) -> str:
    """Convert entity to markdown episode."""
    ...

register_episode_builder(
    "my_case",
    episode_builder,
    source_description="Description of data source",
)

Ontology Loader

class OntologyLoader(Protocol):
    def __init__(self, ontology_dir: Path): ...
    def load(self) -> Ontology: ...

Learn More