Skip to content

Deduplication & Normalization

Why deduplication is the central problem

When a knowledge-graph engine ingests real-world data, the same concept shows up in many forms. "España", "ESPAÑA", "Spain", "Reino de España" and "ES" can all refer to the same entity. If the system creates one node per variant, the graph fragments: queries return partial results, relationships scatter across duplicates, and knowledge quality degrades with every episode ingested.

Aletheia delegates entity and relationship extraction to an LLM. That gives it extraordinary semantic understanding, but introduces a fundamental problem: the LLM is non-deterministic. The same entity can be extracted with slightly different names across episodes, and without a robust deduplication pipeline the graph accumulates duplicates at scale.

This page documents how Aletheia approaches deduplication and normalization — the challenges it solves and the techniques it uses.

Pipeline architecture

The deduplication pipeline operates at two independent levels: nodes (entities) and edges (relationships/facts). Both follow a multi-pass strategy that combines deterministic methods with escalation to the LLM.

Ingestion flow

When an episode is ingested via add_episode(), the sequence is:

  1. Entity extraction (LLM)
  2. Entity deduplication (deterministic + LLM)
  3. Relationship extraction (LLM, using the already-deduplicated entities)
  4. Relationship deduplication (deterministic + LLM)
  5. Attribute extraction (LLM)
  6. Persistence to FalkorDB

The order is critical: entities are deduplicated before relationships are extracted. This guarantees that relationships reference canonical nodes, not transient duplicates.

Entity (node) deduplication

Name normalization

Every name comparison starts with normalization. Aletheia uses two normalization functions with different levels of aggressiveness:

Exact normalization (_normalize_string_exact) — lowercases and collapses runs of whitespace. Used for the first exact-match pass and for fact-deduplication keys on edges.

# "JOHN   SMITH" -> "john smith"
# "   Madrid   " -> "madrid"
normalized = re.sub(r'[\s]+', ' ', name.lower()).strip()

Fuzzy normalization (_normalize_name_for_fuzzy) — strips punctuation except apostrophes, keeping only alphanumerics. Used to generate shingles (n-grams) for MinHash fuzzy matching.

# "John-Smith III" -> "john smith iii"
# "O'Brien (Jr.)" -> "o'brien jr"
normalized = re.sub(r"[^a-z0-9' ]", ' ', exact_normalized).strip()

Three-pass strategy

Node deduplication applies three methods in sequence, from cheapest to most expensive.

Pass 1 — Exact match. After normalizing the name (lowercase + whitespace collapse), Aletheia looks for a literal match against existing entities. If "madrid" already exists and the extracted entity normalizes to "madrid", they merge with no further analysis.

A key design point: exact match always runs first, before any entropy or length filter. This guarantees that short names like "Spain"/"SPAIN" resolve immediately by literal match, with no need to escalate to the LLM.

Pass 2 — Fuzzy similarity (MinHash + LSH). For names that don't match exactly, Aletheia uses MinHash-based Locality-Sensitive Hashing:

  1. Shingle generation — the normalized name is broken into character trigrams. "aeropuerto madrid barajas" produces {"aer", "ero", "rop", "opu", "pue", …}.
  2. MinHash signature — 32 hash permutations over the shingles produce a compact signature of the name.
  3. LSH bands — the signature is split into bands of 4 elements. If two names share at least one identical band, they are duplicate candidates.
  4. Validation — for each candidate, the real Jaccard similarity between shingle sets is computed. A 95 % threshold is required (_FUZZY_JACCARD_THRESHOLD = 0.95).
  5. Edit-distance guard — even above the Jaccard threshold, the Levenshtein distance must be at most 2 characters. This prevents false merges between long alphanumeric identifiers that differ by a few digits (e.g., "Report-001" vs "Report-002").

Entropy gate — before the fuzzy pass, a filter excludes names with Shannon entropy below 1.5 bits, fewer than 6 characters, or fewer than 2 tokens. Short or repetitive names (e.g., "A", "USA") have unreliable MinHash signatures and are escalated straight to the LLM.

System constants:

Parameter Value Purpose
_NAME_ENTROPY_THRESHOLD 1.5 bits Minimum entropy for fuzzy matching
_MIN_NAME_LENGTH 6 chars Minimum length for fuzzy matching
_MIN_TOKEN_COUNT 2 tokens Minimum tokens for fuzzy matching
_FUZZY_JACCARD_THRESHOLD 0.95 Minimum n-gram similarity
_MAX_EDIT_DISTANCE_FOR_AUTO_MERGE 2 Maximum edit distance
_MINHASH_PERMUTATIONS 32 MinHash signature size
_MINHASH_BAND_SIZE 4 LSH band width

Pass 3 — LLM escalation. Entities resolved by neither exact nor fuzzy matching are presented to the LLM for a semantic decision. The prompt includes conversational context (previous episodes and current message), the extracted entity (name, type, type description), and the existing entities (name, type, type description, attributes). The key instruction:

"Entities should only be considered duplicates if they refer to the same real-world object or concept. Use the entity type description to understand which properties or identifiers define each type. If two entities of the same type share a unique identifier described in the type definition (e.g., a code, registration number or ID), they probably refer to the same entity even if their names differ."

Importantly, the prompt provides the entity type description for both the extracted and the existing entities. This symmetric context lets the LLM compare the type definitions on both sides before deciding.

As an extra robustness mechanism, when the LLM returns a duplicate name that doesn't exactly match any existing entity (e.g., it returns "Barajas" when the existing entity is "Aeropuerto Madrid-Barajas"), Aletheia applies a containment fallback: if the returned name is contained in an existing entity's name, the match is accepted.

Bulk resolution

For mass ingestion (add_episode_bulk()), Aletheia adds an extra intra-batch deduplication layer:

  1. Each episode is resolved in parallel against the existing graph.
  2. The resolved nodes from all episodes are compared against each other using deterministic similarity heuristics.
  3. A UUID map is built as a Union-Find structure with path compression, collapsing transitive chains (if A=B and B=C, then A=C).

EntityLockManager — for concurrent standard-quality ingestion, Aletheia implements a per-entity lock manager that coordinates deduplication across parallel workers. Without it, two workers could create duplicates of the same node simultaneously before either sees the other's node.

Relationship (edge) deduplication

Intra-episode

Before any semantic analysis, duplicate relationships within the same episode are removed via a composite key:

key = (source_node_uuid, target_node_uuid, _normalize_string_exact(edge.fact))

If two extracted relationships share source, target and normalized fact, only the first is kept.

Across extraction chunks

When an episode contains many entities, relationship extraction is split into chunks to respect the LLM's context limits (covering chunks guarantee every entity pair is processed). Facts extracted from overlapping chunks are deduplicated with the same composite key:

seen_facts: set[tuple[str, str, str]] = set()
for chunk_edges in chunk_results:
    for edge in chunk_edges:
        key = (edge.source_entity_name, edge.target_entity_name,
               _normalize_string_exact(edge.fact))
        if key not in seen_facts:
            seen_facts.add(key)
            edges_data.append(edge)

Against the existing graph (LLM)

For each extracted relationship, Aletheia finds existing relationships between the same nodes and presents them to the LLM in two separate lists:

  • EXISTING FACTS — facts with the same source/target nodes (exact-duplicate candidates)
  • FACT INVALIDATION CANDIDATES — semantically related facts (contradiction candidates)

The LLM responds with duplicate_facts (indices of semantically identical existing facts) and contradicted_facts (indices of facts the new one contradicts or invalidates). This two-list separation is deliberate: it lets the LLM distinguish between "this fact already exists" and "this fact makes an earlier fact obsolete" — semantically very different decisions.

Equivalence rules in the prompt:

Pattern Treatment
Active/passive voice Equivalent: "A owns B" = "B is property of A"
Numeric format Equivalent: "$6 billion" = "$6,000,000,000" = "6B USD"
Known aliases Equivalent: when "al-Shabaab" and "Harakat al-Shabaab al-Mujahideen" are known to be the same entity
Numeric differences Not duplicates: different numeric values imply different facts
Structural differences Not duplicates: different contexts or qualifiers

Temporal logic — if a new relationship contradicts an existing one, the earlier relationship is marked with an invalid_at field (when it stopped being valid) and, where applicable, expired_at. This preserves the graph's temporal history instead of simply deleting obsolete facts.

Edge-type validation

When the schema defines allowed relationship types (via edge_type_map), Aletheia applies type-signature validation:

  1. For each extracted edge, all valid (source_type, target_type) label pairs are computed.
  2. The allowed relationship types for each pair are looked up in the type map.
  3. If the type the LLM extracted is not in the valid list for that signature, it is converted to RELATES_TO (the generic default type).

Additionally, when relationship types are provided in the schema (FACT_TYPES), the extraction prompt explicitly requires the LLM to use only the defined types: "must be one of the fact_type_name values, do NOT invent." The post-validation acts as a second barrier, converting any unrecognized type to RELATES_TO. This double mechanism (prompt constraint + code validation) is necessary because the LLM is non-deterministic and can ignore instructions.

The identifier_name flag: formal identifiers in the ontology

The challenge of formal identifiers

Fuzzy and LLM deduplication work well for entities with semantic names ("Aeropuerto de Madrid-Barajas", "Ministry of Defence"). But they pose a systematic risk for entities identified by formal alphanumeric codes:

  • Police report codes: "20260000100001" vs "20260000100002" — differ by a single digit, but are completely different reports.
  • Aircraft registrations: "EC-ILR" vs "EC-ILS" — one character of difference, different aircraft.
  • Officer badge numbers: "86000" vs "86001" — different police officers.

Trigram MinHash computes high similarity between these pairs. The LLM, without extra context, may merge them. And a Levenshtein distance of 1–2 is within the auto-merge threshold. All three deduplication mechanisms — fuzzy, LLM and edit-distance guard — are prone to false positives in this scenario.

The solution: aletheia:identifierName

Aletheia introduces an OWL annotation in domain ontologies that marks entity types whose names are formal identifiers:

@prefix aletheia: <http://aletheia.ai/ontology#> .
aletheia:identifierName a owl:AnnotationProperty ;
    rdfs:comment "When true, the entity name is a formal identifier. Dedup uses exact name match only." .

pp:ParteDeIntervencion  aletheia:identifierName true .
pp:Agente               aletheia:identifierName true .
eccairs:Occurrence      aletheia:identifierName true .
eccairs:Aircraft        aletheia:identifierName true .

This annotation is a semantic signal from the ontology author to the deduplication engine: "the names of this entity type are formal identifiers; treat deduplication as an exact-match problem, not a semantic-similarity one."

Propagation flow

The annotation flows through the whole pipeline, from the RDF ontology to the moment of deduplication:

Ontology TTL (aletheia:identifierName true)
   |
   v  Generic Loader (rdflib: _is_identifier_name() reads the RDF triple)
   v  OntologyConcept (field identifier_name: bool = False)
   v  EntityTypeDefinition (field identifier_name: bool = False)
   v  Pydantic code generation:
        class ParteDeIntervencion(CoerciveBaseModel):
            """Police intervention report identified by a numeric code."""
            __identifier_name__: ClassVar[bool] = True
   v  Deduplication pipeline: _is_identifier_name_type() inspects the class attribute
   v  _resolve_exact_only() — normalized exact match only

Behavior in deduplication

When _is_identifier_name_type() returns True for an entity, it is excluded entirely from passes 2 (fuzzy) and 3 (LLM). Only pass 1 (normalized exact match) applies. This means:

  • "20260000100001" and "20260000100002" never merge (their normalized names differ).
  • "20260000100001" and "20260000100001" always merge (exact match).
  • "20260000100001" and " 20260000100001 " merge (normalization collapses whitespace).

The partition is binary and decided at schema-load time, not at runtime. There is no heuristic and no ambiguity: either the type is marked identifier_name and uses exact matching, or it isn't and goes through the full pipeline.

Challenges and lessons learned

Deduplication collision on alphanumeric codes

Impact: 72 % of the errors in the first version of the policia_partes use case. Symptom: Role entities like "Identificado - 20260000100007" and "Identificado - 20260000100008" merged into a single node, collapsing distinct people into one entity. Root cause: The names shared an identical prefix ("Identificado - ") followed by a long numeric suffix. A name's vector embedding has low sensitivity to differences in the last digits of a long numeric sequence. Trigram Jaccard similarity was above 95 %, and Levenshtein distance was 1. Every deduplication mechanism conspired to produce a false positive.

Dual solution:

  1. Redesign the naming scheme — moved from type+code names to natural-language names that include the person's name:
    Before:  "Identificado - 20260000100007"                     (fragile to fuzzy similarity)
    After:   "Identification of KHADIJA DAOUD in 20260000100007"  (robust)
    
    Including the person's name makes the embeddings semantically distinct.
  2. identifier_name flag — marked the Identificacion type as identifierName true, forcing exact match. Even if the embeddings were similar, deduplication would never merge literally different names.

Result: from 44 wrong edges due to person collisions to 0; from 10 incorrectly merged nodes to 0; recall from 87.6 % to 99.3 %.

Lesson: the right fix was not to tune thresholds or add prompt rules. It was to redesign the naming scheme so the representation was inherently distinguishable, and to signal in the ontology that deduplication must be exact. Good data modeling eliminates whole categories of error.

Short names and the entropy gate

Symptom: "Spain" and "SPAIN" were not deduplicated — they appeared as two distinct nodes. Root cause: "Spain" has 5 characters and low Shannon entropy. The entropy gate, designed to exclude short names from fuzzy matching (where they're unreliable), was also blocking exact matching. The name escalated straight to the LLM, which sometimes failed to recognize the trivial duplicate. Solution: exact match always runs first, before any entropy or length filter. No matter how short or simple a name is, if a normalized literal match exists, it is used. The entropy gate only affects the fuzzy (MinHash) pass. Lesson: quality filters should only apply to the mechanism that needs them. A filter designed to protect MinHash was blocking a completely different mechanism (exact match) that didn't need it.

Context fragmentation in long episodes

Symptom: When ingesting long documents, entities declared in one section with a specific name appeared with slightly different names in later sections. Root cause: Long episodes are split into independent chunks for LLM extraction, to respect context limits. But each chunk loses the naming context established in earlier chunks. Not seeing the canonical name declared in a previous chunk, the LLM invents variants. Solution: Aletheia disables automatic chunking for text-type episodes (markdown, prose). These are processed as a single unit, preserving name coherence. Chunking stays active for JSON-type episodes and conversational messages, where the internal structure provides natural boundaries. Lesson: chunking is a trade-off between scalability and coherence. For narrative text, name coherence is more valuable than the ability to process arbitrarily long documents.

Symmetric context in LLM deduplication

Symptom: The LLM incorrectly merged entities of different types because it lacked enough information to tell them apart. Root cause: The deduplication prompt provided the entity type description for newly extracted entities, but not for the entities already in the graph. Without it, the LLM couldn't assess whether two similarly named entities belonged to compatible types. Solution: the prompt now injects the type description (the Pydantic model's __doc__) for both new and existing entities. This symmetry lets the LLM compare the semantic definitions on both sides before deciding. Lesson: when a decision is delegated to the LLM, the quality of the decision depends directly on the quality and completeness of the context provided. Asymmetric context produces biased decisions.

Non-determinism in relationship types

Symptom: Across ingestion runs over the same data, the LLM generated different relationship types for the same semantics: HAS_OPERATOR in one run, OPERATED_BY in the next. This fragmented queries by relationship type. Root cause: When the prompt lets the LLM invent relationship types freely (in SCREAMING_SNAKE_CASE), the LLM's intrinsic non-determinism produces variations. The same semantic relationship can be named in multiple equally valid ways. Solution: a double constraint. First, the extraction prompt requires the LLM to use only the types defined in the schema: "must be one of the fact_type_name values, do NOT invent." Second, a post-validation converts any unrecognized type to RELATES_TO. The first barrier is an instruction to the LLM; the second is a deterministic guarantee in code. Lesson: instructions to the LLM are suggestions, not guarantees. Any critical constraint needs a second deterministic barrier in code.

Race condition in parallel ingestion

Symptom: When ingesting multiple episodes in parallel with several workers, the same concept could be extracted in two workers simultaneously. Both queried the graph, found nothing, and each created a new node — persisted duplicates that slipped past the deduplication pipeline. Solution: EntityLockManager — a per-entity-name lock manager that serializes the deduplication phase for entities with the same normalized name. Workers can extract entities in parallel (the expensive LLM phase), but deduplication and persistence are serialized per entity. The lock is granular: only entities with the same normalized name serialize against each other; entities with different names proceed without contention. Lesson: distributed deduplication is a consensus problem. Without explicit coordination, parallelism introduces a race window between the read ("does this entity exist?") and the write ("I create this entity"). A per-name lock is the minimum compromise between parallelism and correctness.

Technique summary

Technique Level Cost When it applies
Normalized exact match Nodes & edges O(1) lookup Always, first pass
MinHash + LSH Nodes O(n) signatures Names with entropy ≥ 1.5 bits, ≥ 6 chars, ≥ 2 tokens
Jaccard + Levenshtein Nodes O(k) candidates LSH candidates
LLM escalation Nodes 1 LLM call Nodes unresolved by earlier passes
Composite key (source, target, fact) Edges O(1) lookup Intra-episode & inter-chunk dedup
Embedding + cosine similarity Edges O(n) search Duplicate/contradiction candidate search
LLM with dual lists Edges 1 LLM call Duplicate & contradiction detection
identifier_name flag Nodes O(1) lookup Entity types with formal identifiers
Type-signature validation Edges O(k) lookup When the schema defines relationship types
Union-Find with compression Batches O(n) amortized Intra-batch dedup in mass ingestion
EntityLockManager Concurrency O(1) lock Parallel multi-worker ingestion

Conclusions

Deduplication in an LLM-built knowledge graph is a multi-dimensional problem that no single technique solves. Aletheia's approach combines:

  1. Fast deterministic methods (normalization, exact match, MinHash) to resolve most cases with no LLM cost.
  2. Intelligent LLM escalation when deterministic methods aren't enough, with enriched prompts that include symmetric type context for both sides of the comparison.
  3. Ontological signaling (identifier_name) so the ontology author can declare that certain entity types have formal names that must not be subjected to fuzzy logic.
  4. Type constraints in relationship extraction with a double barrier (prompt + deterministic validation) to prevent the proliferation of non-deterministic types.
  5. Concurrency coordination via granular per-entity locks to prevent duplicates in parallel ingestion.

The most important lesson is that deduplication is not just an algorithmic problem — it is a data-design problem. A well-designed naming scheme (natural-language names, formal identifiers signaled in the ontology) drastically reduces the load on the deduplication pipeline. When the pipeline fails consistently, the right answer is to improve the data modeling, not to add more rules to the LLM prompt.