Skip to content

Understanding Embeddings and Keeping Them Consistent

This FAQ explains what embeddings are, how Graphiti uses them, and why you must use the same embedding configuration across all operations.

What Are Embeddings?

Embeddings are numerical representations of text as vectors (arrays of numbers). They capture semantic meaning, allowing computers to measure how similar two pieces of text are.

"Hamas is a terrorist organization"  →  [0.23, -0.45, 0.12, ..., 0.67]  (384 numbers)
"Hezbollah is a militant group"      →  [0.21, -0.42, 0.15, ..., 0.64]  (384 numbers)
                                     Similar vectors = Similar meaning

Key Properties

Property Description
Dimensions Number of values in the vector (e.g., 384, 768, 1024)
Model-specific Different models produce different vectors for the same text
Semantic similarity Similar meanings → vectors point in similar directions
Distance metrics Cosine similarity measures angle between vectors

Common Embedding Models

Model Dimensions Provider
text-embedding-3-small 1536 OpenAI
text-embedding-ada-002 1536 OpenAI
BAAI/bge-small-en-v1.5 384 HuggingFace (local)
BAAI/bge-base-en-v1.5 768 HuggingFace (local)
all-MiniLM-L6-v2 384 HuggingFace (local)

How Graphiti Uses Embeddings

Graphiti uses embeddings at multiple stages in the knowledge graph lifecycle:

1. Ingestion (Building the Graph)

When you run aletheia build-knowledge-graph, Graphiti:

Episode text
    ▼ [Entity Extraction - LLM]
Entities & Relationships
    ▼ [Embedding Generation]
Each node and edge gets an embedding vector:
- Node: embedding of node.name + node.summary
- Edge: embedding of edge.fact
    ▼ [Storage]
Vectors stored in graph database (FalkorDB/Neo4j)

What gets embedded:

Component Embedded Text Purpose
Entity nodes name + summary Find similar entities
Entity edges fact Find similar relationships
Episodes content Find similar source documents

2. Search (Querying the Graph)

When you search the graph, Graphiti:

User query: "What terrorist groups operate in Gaza?"
    ▼ [Embedding Generation]
Query vector: [0.34, -0.21, ..., 0.56]
    ▼ [Vector Similarity Search]
Compare query vector to stored node/edge vectors
    ▼ [Results]
Return nodes and edges with highest cosine similarity

3. Evaluation (RAGAS Metrics)

During evaluation, the AnswerSimilarity metric:

Generated answer: "Hamas operates primarily in Gaza"
Gold answer: "Hamas is based in the Gaza Strip"
    ▼ [Embedding Generation]
Both answers → vectors
    ▼ [Cosine Similarity]
Similarity score: 0.89 (high = good)

The Consistency Requirement

Critical Rule: You must use the same embedding model for all operations on a given knowledge graph.

┌─────────────────────────────────────────────────────────────┐
│                    MUST BE IDENTICAL                        │
├─────────────────────────────────────────────────────────────┤
│  Ingestion    →  BAAI/bge-small-en-v1.5  →  384 dims       │
│  Search       →  BAAI/bge-small-en-v1.5  →  384 dims       │
│  Evaluation   →  BAAI/bge-small-en-v1.5  →  384 dims       │
└─────────────────────────────────────────────────────────────┘

Why?

  1. Dimension mismatch: You can't compare a 384-dim vector to a 1024-dim vector
  2. Semantic mismatch: Different models encode meaning differently—even with same dimensions, vectors from different models are incompatible
  3. Garbage results: Comparing vectors from different models produces meaningless similarity scores

The Bug We Fixed (TODO-050)

What Happened

We ingested a knowledge graph using local embeddings (384 dimensions), but RAGAS evaluation was hardcoded to use OpenAI embeddings (1536 dimensions):

Ingestion (Graphiti):     BAAI/bge-small-en-v1.5    → 384 dims stored
Evaluation (RAGAS):       text-embedding-3-small    → 1536 dims computed
                          DIMENSION MISMATCH ERROR

Symptoms

  • RAGAS AnswerSimilarity metric failed or returned incorrect scores
  • Error messages about incompatible vector dimensions
  • Evaluation worked with OpenAI embeddings but broke with local embeddings

The Fix

We added _create_ragas_embeddings() in ragas_evaluator.py that reads the same .env configuration used by Graphiti:

def _create_ragas_embeddings(self):
    """Create RAGAS embeddings from .env configuration.

    This ensures RAGAS uses the same embedding model as Graphiti,
    avoiding dimension mismatches between stored vectors and evaluation.
    """
    embedding_provider = os.getenv("EMBEDDING_PROVIDER", "openai").lower()

    if embedding_provider == "local":
        from ragas.embeddings import HuggingFaceEmbeddings
        model_name = os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
        return HuggingFaceEmbeddings(model=model_name)

    # Default: OpenAI embeddings
    from ragas.embeddings import OpenAIEmbeddings
    return OpenAIEmbeddings(client=OpenAI())

Configuration

Setting Embedding Provider

In your .env file:

# Option 1: OpenAI embeddings (API calls, higher quality)
EMBEDDING_PROVIDER=openai

# Option 2: Local embeddings (no API, faster, free)
EMBEDDING_PROVIDER=local
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5

Local Embedding Options

Model Dimensions Quality Speed
BAAI/bge-small-en-v1.5 384 Good Fast
BAAI/bge-base-en-v1.5 768 Better Medium
all-MiniLM-L6-v2 384 Basic Very Fast

Device Selection (Local Only)

# Auto-detect best available (default)
EMBEDDING_DEVICE=auto

# Force specific device
EMBEDDING_DEVICE=mps    # Apple Silicon GPU
EMBEDDING_DEVICE=cuda   # NVIDIA GPU
EMBEDDING_DEVICE=cpu    # CPU only

Troubleshooting

"Dimension mismatch" Error

Cause: Different embedding models used for ingestion vs search/evaluation.

Solution: 1. Check your .env settings match what was used during ingestion 2. If you changed embedding providers, you must rebuild the graph:

aletheia build-knowledge-graph \
  --use-case my_case \
  --knowledge-graph my_graph \
  --reset  # Clears and rebuilds with current embedding settings

"Vector index not found" Error

Cause: Graph was built without vector indices, or indices use different dimensions.

Solution: Rebuild indices or the entire graph with consistent settings.

Poor Search Results Despite Good Data

Cause: Possible embedding model mismatch or using wrong model for your domain.

Solutions: 1. Verify embedding consistency across all operations 2. Try a different embedding model better suited to your domain 3. Enable hybrid search (BM25 + vector) to catch keyword matches that semantic search misses


Best Practices

1. Document Your Embedding Configuration

When creating a knowledge graph, record: - Embedding provider (openai/local) - Model name and version - Dimensions

2. Use Environment Variables Consistently

Always use .env for embedding configuration—never hardcode:

# Bad - hardcoded
embedder = OpenAIEmbedder()

# Good - from configuration
embedder = create_embedder_from_config()  # Reads .env

3. Consider Local Embeddings for Development

Local embeddings are: - Free (no API costs) - Fast (no network latency) - Private (data never leaves your machine)

# Install once
pip install sentence-transformers

# Configure
EMBEDDING_PROVIDER=local
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5

4. Test Before Large Ingestion

Before ingesting thousands of episodes, verify your embedding setup:

# Ingest a small subset
aletheia build-knowledge-graph --use-case my_case --knowledge-graph test_graph

# Run evaluation to verify everything works
aletheia evaluate-ragas --knowledge-graph test_graph --questions test_questions.json

Summary

Rule Why
Same model for all operations Vectors must be comparable
Same dimensions Can't compare 384-dim to 1536-dim
Configuration via .env Ensures consistency across components
Rebuild graph if changing embeddings Stored vectors become invalid