Understanding Embeddings and Keeping Them Consistent¶

This FAQ explains what embeddings are, how Graphiti uses them, and why you must use the same embedding configuration across all operations.

What Are Embeddings?¶

Embeddings are numerical representations of text as vectors (arrays of numbers). They capture semantic meaning, allowing computers to measure how similar two pieces of text are.

"Hamas is a terrorist organization"  →  [0.23, -0.45, 0.12, ..., 0.67]  (384 numbers)
"Hezbollah is a militant group"      →  [0.21, -0.42, 0.15, ..., 0.64]  (384 numbers)
                                              ↓
                                     Similar vectors = Similar meaning

Key Properties¶

Property	Description
Dimensions	Number of values in the vector (e.g., 384, 768, 1024)
Model-specific	Different models produce different vectors for the same text
Semantic similarity	Similar meanings → vectors point in similar directions
Distance metrics	Cosine similarity measures angle between vectors

Common Embedding Models¶

Model	Dimensions	Provider
`text-embedding-3-small`	1536	OpenAI
`text-embedding-ada-002`	1536	OpenAI
`BAAI/bge-small-en-v1.5`	384	HuggingFace (local)
`BAAI/bge-base-en-v1.5`	768	HuggingFace (local)
`all-MiniLM-L6-v2`	384	HuggingFace (local)

How Graphiti Uses Embeddings¶

Graphiti uses embeddings at multiple stages in the knowledge graph lifecycle:

1. Ingestion (Building the Graph)¶

When you run aletheia build-knowledge-graph, Graphiti:

Episode text
    │
    ▼ [Entity Extraction - LLM]
Entities & Relationships
    │
    ▼ [Embedding Generation]
Each node and edge gets an embedding vector:
- Node: embedding of node.name + node.summary
- Edge: embedding of edge.fact
    │
    ▼ [Storage]
Vectors stored in graph database (FalkorDB/Neo4j)

What gets embedded:

Component	Embedded Text	Purpose
Entity nodes	`name + summary`	Find similar entities
Entity edges	`fact`	Find similar relationships
Episodes	`content`	Find similar source documents

2. Search (Querying the Graph)¶

When you search the graph, Graphiti:

User query: "What terrorist groups operate in Gaza?"
    │
    ▼ [Embedding Generation]
Query vector: [0.34, -0.21, ..., 0.56]
    │
    ▼ [Vector Similarity Search]
Compare query vector to stored node/edge vectors
    │
    ▼ [Results]
Return nodes and edges with highest cosine similarity

3. Evaluation (RAGAS Metrics)¶

During evaluation, the AnswerSimilarity metric:

Generated answer: "Hamas operates primarily in Gaza"
Gold answer: "Hamas is based in the Gaza Strip"
    │
    ▼ [Embedding Generation]
Both answers → vectors
    │
    ▼ [Cosine Similarity]
Similarity score: 0.89 (high = good)

The Consistency Requirement¶

Critical Rule: You must use the same embedding model for all operations on a given knowledge graph.

┌─────────────────────────────────────────────────────────────┐
│                    MUST BE IDENTICAL                        │
├─────────────────────────────────────────────────────────────┤
│  Ingestion    →  BAAI/bge-small-en-v1.5  →  384 dims       │
│  Search       →  BAAI/bge-small-en-v1.5  →  384 dims       │
│  Evaluation   →  BAAI/bge-small-en-v1.5  →  384 dims       │
└─────────────────────────────────────────────────────────────┘

Why?¶

Dimension mismatch: You can't compare a 384-dim vector to a 1024-dim vector
Semantic mismatch: Different models encode meaning differently—even with same dimensions, vectors from different models are incompatible
Garbage results: Comparing vectors from different models produces meaningless similarity scores

The Bug We Fixed (TODO-050)¶

What Happened¶

We ingested a knowledge graph using local embeddings (384 dimensions), but RAGAS evaluation was hardcoded to use OpenAI embeddings (1536 dimensions):

Ingestion (Graphiti):     BAAI/bge-small-en-v1.5    → 384 dims stored
Evaluation (RAGAS):       text-embedding-3-small    → 1536 dims computed
                                                    ↓
                          DIMENSION MISMATCH ERROR

Symptoms¶

RAGAS AnswerSimilarity metric failed or returned incorrect scores
Error messages about incompatible vector dimensions
Evaluation worked with OpenAI embeddings but broke with local embeddings

The Fix¶

We added _create_ragas_embeddings() in ragas_evaluator.py that reads the same .env configuration used by Graphiti:

def _create_ragas_embeddings(self):
    """Create RAGAS embeddings from .env configuration.

    This ensures RAGAS uses the same embedding model as Graphiti,
    avoiding dimension mismatches between stored vectors and evaluation.
    """
    embedding_provider = os.getenv("EMBEDDING_PROVIDER", "openai").lower()

    if embedding_provider == "local":
        from ragas.embeddings import HuggingFaceEmbeddings
        model_name = os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
        return HuggingFaceEmbeddings(model=model_name)

    # Default: OpenAI embeddings
    from ragas.embeddings import OpenAIEmbeddings
    return OpenAIEmbeddings(client=OpenAI())

Configuration¶

Setting Embedding Provider¶

In your .env file:

# Option 1: OpenAI embeddings (API calls, higher quality)
EMBEDDING_PROVIDER=openai

# Option 2: Local embeddings (no API, faster, free)
EMBEDDING_PROVIDER=local
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5

Local Embedding Options¶

Model	Dimensions	Quality	Speed
`BAAI/bge-small-en-v1.5`	384	Good	Fast
`BAAI/bge-base-en-v1.5`	768	Better	Medium
`all-MiniLM-L6-v2`	384	Basic	Very Fast

Device Selection (Local Only)¶

# Auto-detect best available (default)
EMBEDDING_DEVICE=auto

# Force specific device
EMBEDDING_DEVICE=mps    # Apple Silicon GPU
EMBEDDING_DEVICE=cuda   # NVIDIA GPU
EMBEDDING_DEVICE=cpu    # CPU only

Troubleshooting¶

"Dimension mismatch" Error¶

Cause: Different embedding models used for ingestion vs search/evaluation.

Solution: 1. Check your .env settings match what was used during ingestion 2. If you changed embedding providers, you must rebuild the graph:

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --reset  # Clears and rebuilds with current embedding settings

"Vector index not found" Error¶

Cause: Graph was built without vector indices, or indices use different dimensions.

Solution: Rebuild indices or the entire graph with consistent settings.

Poor Search Results Despite Good Data¶

Cause: Possible embedding model mismatch or using wrong model for your domain.

Solutions: 1. Verify embedding consistency across all operations 2. Try a different embedding model better suited to your domain 3. Enable hybrid search (BM25 + vector) to catch keyword matches that semantic search misses

Best Practices¶

1. Document Your Embedding Configuration¶

When creating a knowledge graph, record: - Embedding provider (openai/local) - Model name and version - Dimensions

2. Use Environment Variables Consistently¶

Always use .env for embedding configuration—never hardcode:

# Bad - hardcoded
embedder = OpenAIEmbedder()

# Good - from configuration
embedder = create_embedder_from_config()  # Reads .env

3. Consider Local Embeddings for Development¶

Local embeddings are: - Free (no API costs) - Fast (no network latency) - Private (data never leaves your machine)

# Install once
pip install sentence-transformers

# Configure
EMBEDDING_PROVIDER=local
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5

4. Test Before Large Ingestion¶

Before ingesting thousands of episodes, verify your embedding setup:

# Ingest a small subset
aletheia build-knowledge-graph --use-case my_case --knowledge-graph test_graph

# Run evaluation to verify everything works
aletheia evaluate-ragas --knowledge-graph test_graph --questions test_questions.json

Summary¶

Rule	Why
Same model for all operations	Vectors must be comparable
Same dimensions	Can't compare 384-dim to 1536-dim
Configuration via `.env`	Ensures consistency across components
Rebuild graph if changing embeddings	Stored vectors become invalid