Deduplication & Normalization¶
Why deduplication is the central problem¶
When a knowledge-graph engine ingests real-world data, the same concept shows up in many forms. "España", "ESPAÑA", "Spain", "Reino de España" and "ES" can all refer to the same entity. If the system creates one node per variant, the graph fragments: queries return partial results, relationships scatter across duplicates, and knowledge quality degrades with every episode ingested.
Aletheia delegates entity and relationship extraction to an LLM. That gives it extraordinary semantic understanding, but introduces a fundamental problem: the LLM is non-deterministic. The same entity can be extracted with slightly different names across episodes, and without a robust deduplication pipeline the graph accumulates duplicates at scale.
This page documents how Aletheia approaches deduplication and normalization — the challenges it solves and the techniques it uses.
Pipeline architecture¶
The deduplication pipeline operates at two independent levels: nodes (entities) and edges (relationships/facts). Both follow a multi-pass strategy that combines deterministic methods with escalation to the LLM.
Ingestion flow¶
When an episode is ingested via add_episode(), the sequence is:
- Entity extraction (LLM)
- Entity deduplication (deterministic + LLM)
- Relationship extraction (LLM, using the already-deduplicated entities)
- Relationship deduplication (deterministic + LLM)
- Attribute extraction (LLM)
- Persistence to FalkorDB
The order is critical: entities are deduplicated before relationships are extracted. This guarantees that relationships reference canonical nodes, not transient duplicates.
Entity (node) deduplication¶
Name normalization¶
Every name comparison starts with normalization. Aletheia uses two normalization functions with different levels of aggressiveness:
Exact normalization (_normalize_string_exact) — lowercases and collapses runs of whitespace. Used for the first exact-match pass and for fact-deduplication keys on edges.
# "JOHN SMITH" -> "john smith"
# " Madrid " -> "madrid"
normalized = re.sub(r'[\s]+', ' ', name.lower()).strip()
Fuzzy normalization (_normalize_name_for_fuzzy) — strips punctuation except apostrophes, keeping only alphanumerics. Used to generate shingles (n-grams) for MinHash fuzzy matching.
# "John-Smith III" -> "john smith iii"
# "O'Brien (Jr.)" -> "o'brien jr"
normalized = re.sub(r"[^a-z0-9' ]", ' ', exact_normalized).strip()
Three-pass strategy¶
Node deduplication applies three methods in sequence, from cheapest to most expensive.
Pass 1 — Exact match. After normalizing the name (lowercase + whitespace collapse), Aletheia looks for a literal match against existing entities. If "madrid" already exists and the extracted entity normalizes to "madrid", they merge with no further analysis.
A key design point: exact match always runs first, before any entropy or length filter. This guarantees that short names like "Spain"/"SPAIN" resolve immediately by literal match, with no need to escalate to the LLM.
Pass 2 — Fuzzy similarity (MinHash + LSH). For names that don't match exactly, Aletheia uses MinHash-based Locality-Sensitive Hashing:
- Shingle generation — the normalized name is broken into character trigrams. "aeropuerto madrid barajas" produces
{"aer", "ero", "rop", "opu", "pue", …}. - MinHash signature — 32 hash permutations over the shingles produce a compact signature of the name.
- LSH bands — the signature is split into bands of 4 elements. If two names share at least one identical band, they are duplicate candidates.
- Validation — for each candidate, the real Jaccard similarity between shingle sets is computed. A 95 % threshold is required (
_FUZZY_JACCARD_THRESHOLD = 0.95). - Edit-distance guard — even above the Jaccard threshold, the Levenshtein distance must be at most 2 characters. This prevents false merges between long alphanumeric identifiers that differ by a few digits (e.g., "Report-001" vs "Report-002").
Entropy gate — before the fuzzy pass, a filter excludes names with Shannon entropy below 1.5 bits, fewer than 6 characters, or fewer than 2 tokens. Short or repetitive names (e.g., "A", "USA") have unreliable MinHash signatures and are escalated straight to the LLM.
System constants:
| Parameter | Value | Purpose |
|---|---|---|
_NAME_ENTROPY_THRESHOLD | 1.5 bits | Minimum entropy for fuzzy matching |
_MIN_NAME_LENGTH | 6 chars | Minimum length for fuzzy matching |
_MIN_TOKEN_COUNT | 2 tokens | Minimum tokens for fuzzy matching |
_FUZZY_JACCARD_THRESHOLD | 0.95 | Minimum n-gram similarity |
_MAX_EDIT_DISTANCE_FOR_AUTO_MERGE | 2 | Maximum edit distance |
_MINHASH_PERMUTATIONS | 32 | MinHash signature size |
_MINHASH_BAND_SIZE | 4 | LSH band width |
Pass 3 — LLM escalation. Entities resolved by neither exact nor fuzzy matching are presented to the LLM for a semantic decision. The prompt includes conversational context (previous episodes and current message), the extracted entity (name, type, type description), and the existing entities (name, type, type description, attributes). The key instruction:
"Entities should only be considered duplicates if they refer to the same real-world object or concept. Use the entity type description to understand which properties or identifiers define each type. If two entities of the same type share a unique identifier described in the type definition (e.g., a code, registration number or ID), they probably refer to the same entity even if their names differ."
Importantly, the prompt provides the entity type description for both the extracted and the existing entities. This symmetric context lets the LLM compare the type definitions on both sides before deciding.
As an extra robustness mechanism, when the LLM returns a duplicate name that doesn't exactly match any existing entity (e.g., it returns "Barajas" when the existing entity is "Aeropuerto Madrid-Barajas"), Aletheia applies a containment fallback: if the returned name is contained in an existing entity's name, the match is accepted.
Bulk resolution¶
For mass ingestion (add_episode_bulk()), Aletheia adds an extra intra-batch deduplication layer:
- Each episode is resolved in parallel against the existing graph.
- The resolved nodes from all episodes are compared against each other using deterministic similarity heuristics.
- A UUID map is built as a Union-Find structure with path compression, collapsing transitive chains (if A=B and B=C, then A=C).
EntityLockManager — for concurrent standard-quality ingestion, Aletheia implements a per-entity lock manager that coordinates deduplication across parallel workers. Without it, two workers could create duplicates of the same node simultaneously before either sees the other's node.
Relationship (edge) deduplication¶
Intra-episode¶
Before any semantic analysis, duplicate relationships within the same episode are removed via a composite key:
If two extracted relationships share source, target and normalized fact, only the first is kept.
Across extraction chunks¶
When an episode contains many entities, relationship extraction is split into chunks to respect the LLM's context limits (covering chunks guarantee every entity pair is processed). Facts extracted from overlapping chunks are deduplicated with the same composite key:
seen_facts: set[tuple[str, str, str]] = set()
for chunk_edges in chunk_results:
for edge in chunk_edges:
key = (edge.source_entity_name, edge.target_entity_name,
_normalize_string_exact(edge.fact))
if key not in seen_facts:
seen_facts.add(key)
edges_data.append(edge)
Against the existing graph (LLM)¶
For each extracted relationship, Aletheia finds existing relationships between the same nodes and presents them to the LLM in two separate lists:
- EXISTING FACTS — facts with the same source/target nodes (exact-duplicate candidates)
- FACT INVALIDATION CANDIDATES — semantically related facts (contradiction candidates)
The LLM responds with duplicate_facts (indices of semantically identical existing facts) and contradicted_facts (indices of facts the new one contradicts or invalidates). This two-list separation is deliberate: it lets the LLM distinguish between "this fact already exists" and "this fact makes an earlier fact obsolete" — semantically very different decisions.
Equivalence rules in the prompt:
| Pattern | Treatment |
|---|---|
| Active/passive voice | Equivalent: "A owns B" = "B is property of A" |
| Numeric format | Equivalent: "$6 billion" = "$6,000,000,000" = "6B USD" |
| Known aliases | Equivalent: when "al-Shabaab" and "Harakat al-Shabaab al-Mujahideen" are known to be the same entity |
| Numeric differences | Not duplicates: different numeric values imply different facts |
| Structural differences | Not duplicates: different contexts or qualifiers |
Temporal logic — if a new relationship contradicts an existing one, the earlier relationship is marked with an invalid_at field (when it stopped being valid) and, where applicable, expired_at. This preserves the graph's temporal history instead of simply deleting obsolete facts.
Edge-type validation¶
When the schema defines allowed relationship types (via edge_type_map), Aletheia applies type-signature validation:
- For each extracted edge, all valid
(source_type, target_type)label pairs are computed. - The allowed relationship types for each pair are looked up in the type map.
- If the type the LLM extracted is not in the valid list for that signature, it is converted to
RELATES_TO(the generic default type).
Additionally, when relationship types are provided in the schema (FACT_TYPES), the extraction prompt explicitly requires the LLM to use only the defined types: "must be one of the fact_type_name values, do NOT invent." The post-validation acts as a second barrier, converting any unrecognized type to RELATES_TO. This double mechanism (prompt constraint + code validation) is necessary because the LLM is non-deterministic and can ignore instructions.
The identifier_name flag: formal identifiers in the ontology¶
The challenge of formal identifiers¶
Fuzzy and LLM deduplication work well for entities with semantic names ("Aeropuerto de Madrid-Barajas", "Ministry of Defence"). But they pose a systematic risk for entities identified by formal alphanumeric codes:
- Police report codes: "20260000100001" vs "20260000100002" — differ by a single digit, but are completely different reports.
- Aircraft registrations: "EC-ILR" vs "EC-ILS" — one character of difference, different aircraft.
- Officer badge numbers: "86000" vs "86001" — different police officers.
Trigram MinHash computes high similarity between these pairs. The LLM, without extra context, may merge them. And a Levenshtein distance of 1–2 is within the auto-merge threshold. All three deduplication mechanisms — fuzzy, LLM and edit-distance guard — are prone to false positives in this scenario.
The solution: aletheia:identifierName¶
Aletheia introduces an OWL annotation in domain ontologies that marks entity types whose names are formal identifiers:
@prefix aletheia: <http://aletheia.ai/ontology#> .
aletheia:identifierName a owl:AnnotationProperty ;
rdfs:comment "When true, the entity name is a formal identifier. Dedup uses exact name match only." .
pp:ParteDeIntervencion aletheia:identifierName true .
pp:Agente aletheia:identifierName true .
eccairs:Occurrence aletheia:identifierName true .
eccairs:Aircraft aletheia:identifierName true .
This annotation is a semantic signal from the ontology author to the deduplication engine: "the names of this entity type are formal identifiers; treat deduplication as an exact-match problem, not a semantic-similarity one."
Propagation flow¶
The annotation flows through the whole pipeline, from the RDF ontology to the moment of deduplication:
Ontology TTL (aletheia:identifierName true)
|
v Generic Loader (rdflib: _is_identifier_name() reads the RDF triple)
v OntologyConcept (field identifier_name: bool = False)
v EntityTypeDefinition (field identifier_name: bool = False)
v Pydantic code generation:
class ParteDeIntervencion(CoerciveBaseModel):
"""Police intervention report identified by a numeric code."""
__identifier_name__: ClassVar[bool] = True
v Deduplication pipeline: _is_identifier_name_type() inspects the class attribute
v _resolve_exact_only() — normalized exact match only
Behavior in deduplication¶
When _is_identifier_name_type() returns True for an entity, it is excluded entirely from passes 2 (fuzzy) and 3 (LLM). Only pass 1 (normalized exact match) applies. This means:
- "20260000100001" and "20260000100002" never merge (their normalized names differ).
- "20260000100001" and "20260000100001" always merge (exact match).
- "20260000100001" and " 20260000100001 " merge (normalization collapses whitespace).
The partition is binary and decided at schema-load time, not at runtime. There is no heuristic and no ambiguity: either the type is marked identifier_name and uses exact matching, or it isn't and goes through the full pipeline.
Challenges and lessons learned¶
Deduplication collision on alphanumeric codes¶
Impact: 72 % of the errors in the first version of the policia_partes use case. Symptom: Role entities like "Identificado - 20260000100007" and "Identificado - 20260000100008" merged into a single node, collapsing distinct people into one entity. Root cause: The names shared an identical prefix ("Identificado - ") followed by a long numeric suffix. A name's vector embedding has low sensitivity to differences in the last digits of a long numeric sequence. Trigram Jaccard similarity was above 95 %, and Levenshtein distance was 1. Every deduplication mechanism conspired to produce a false positive.
Dual solution:
- Redesign the naming scheme — moved from type+code names to natural-language names that include the person's name: Including the person's name makes the embeddings semantically distinct.
identifier_nameflag — marked theIdentificaciontype asidentifierName true, forcing exact match. Even if the embeddings were similar, deduplication would never merge literally different names.
Result: from 44 wrong edges due to person collisions to 0; from 10 incorrectly merged nodes to 0; recall from 87.6 % to 99.3 %.
Lesson: the right fix was not to tune thresholds or add prompt rules. It was to redesign the naming scheme so the representation was inherently distinguishable, and to signal in the ontology that deduplication must be exact. Good data modeling eliminates whole categories of error.
Short names and the entropy gate¶
Symptom: "Spain" and "SPAIN" were not deduplicated — they appeared as two distinct nodes. Root cause: "Spain" has 5 characters and low Shannon entropy. The entropy gate, designed to exclude short names from fuzzy matching (where they're unreliable), was also blocking exact matching. The name escalated straight to the LLM, which sometimes failed to recognize the trivial duplicate. Solution: exact match always runs first, before any entropy or length filter. No matter how short or simple a name is, if a normalized literal match exists, it is used. The entropy gate only affects the fuzzy (MinHash) pass. Lesson: quality filters should only apply to the mechanism that needs them. A filter designed to protect MinHash was blocking a completely different mechanism (exact match) that didn't need it.
Context fragmentation in long episodes¶
Symptom: When ingesting long documents, entities declared in one section with a specific name appeared with slightly different names in later sections. Root cause: Long episodes are split into independent chunks for LLM extraction, to respect context limits. But each chunk loses the naming context established in earlier chunks. Not seeing the canonical name declared in a previous chunk, the LLM invents variants. Solution: Aletheia disables automatic chunking for text-type episodes (markdown, prose). These are processed as a single unit, preserving name coherence. Chunking stays active for JSON-type episodes and conversational messages, where the internal structure provides natural boundaries. Lesson: chunking is a trade-off between scalability and coherence. For narrative text, name coherence is more valuable than the ability to process arbitrarily long documents.
Symmetric context in LLM deduplication¶
Symptom: The LLM incorrectly merged entities of different types because it lacked enough information to tell them apart. Root cause: The deduplication prompt provided the entity type description for newly extracted entities, but not for the entities already in the graph. Without it, the LLM couldn't assess whether two similarly named entities belonged to compatible types. Solution: the prompt now injects the type description (the Pydantic model's __doc__) for both new and existing entities. This symmetry lets the LLM compare the semantic definitions on both sides before deciding. Lesson: when a decision is delegated to the LLM, the quality of the decision depends directly on the quality and completeness of the context provided. Asymmetric context produces biased decisions.
Non-determinism in relationship types¶
Symptom: Across ingestion runs over the same data, the LLM generated different relationship types for the same semantics: HAS_OPERATOR in one run, OPERATED_BY in the next. This fragmented queries by relationship type. Root cause: When the prompt lets the LLM invent relationship types freely (in SCREAMING_SNAKE_CASE), the LLM's intrinsic non-determinism produces variations. The same semantic relationship can be named in multiple equally valid ways. Solution: a double constraint. First, the extraction prompt requires the LLM to use only the types defined in the schema: "must be one of the fact_type_name values, do NOT invent." Second, a post-validation converts any unrecognized type to RELATES_TO. The first barrier is an instruction to the LLM; the second is a deterministic guarantee in code. Lesson: instructions to the LLM are suggestions, not guarantees. Any critical constraint needs a second deterministic barrier in code.
Race condition in parallel ingestion¶
Symptom: When ingesting multiple episodes in parallel with several workers, the same concept could be extracted in two workers simultaneously. Both queried the graph, found nothing, and each created a new node — persisted duplicates that slipped past the deduplication pipeline. Solution: EntityLockManager — a per-entity-name lock manager that serializes the deduplication phase for entities with the same normalized name. Workers can extract entities in parallel (the expensive LLM phase), but deduplication and persistence are serialized per entity. The lock is granular: only entities with the same normalized name serialize against each other; entities with different names proceed without contention. Lesson: distributed deduplication is a consensus problem. Without explicit coordination, parallelism introduces a race window between the read ("does this entity exist?") and the write ("I create this entity"). A per-name lock is the minimum compromise between parallelism and correctness.
Technique summary¶
| Technique | Level | Cost | When it applies |
|---|---|---|---|
| Normalized exact match | Nodes & edges | O(1) lookup | Always, first pass |
| MinHash + LSH | Nodes | O(n) signatures | Names with entropy ≥ 1.5 bits, ≥ 6 chars, ≥ 2 tokens |
| Jaccard + Levenshtein | Nodes | O(k) candidates | LSH candidates |
| LLM escalation | Nodes | 1 LLM call | Nodes unresolved by earlier passes |
| Composite key (source, target, fact) | Edges | O(1) lookup | Intra-episode & inter-chunk dedup |
| Embedding + cosine similarity | Edges | O(n) search | Duplicate/contradiction candidate search |
| LLM with dual lists | Edges | 1 LLM call | Duplicate & contradiction detection |
identifier_name flag | Nodes | O(1) lookup | Entity types with formal identifiers |
| Type-signature validation | Edges | O(k) lookup | When the schema defines relationship types |
| Union-Find with compression | Batches | O(n) amortized | Intra-batch dedup in mass ingestion |
EntityLockManager | Concurrency | O(1) lock | Parallel multi-worker ingestion |
Conclusions¶
Deduplication in an LLM-built knowledge graph is a multi-dimensional problem that no single technique solves. Aletheia's approach combines:
- Fast deterministic methods (normalization, exact match, MinHash) to resolve most cases with no LLM cost.
- Intelligent LLM escalation when deterministic methods aren't enough, with enriched prompts that include symmetric type context for both sides of the comparison.
- Ontological signaling (
identifier_name) so the ontology author can declare that certain entity types have formal names that must not be subjected to fuzzy logic. - Type constraints in relationship extraction with a double barrier (prompt + deterministic validation) to prevent the proliferation of non-deterministic types.
- Concurrency coordination via granular per-entity locks to prevent duplicates in parallel ingestion.
The most important lesson is that deduplication is not just an algorithmic problem — it is a data-design problem. A well-designed naming scheme (natural-language names, formal identifiers signaled in the ontology) drastically reduces the load on the deduplication pipeline. When the pipeline fails consistently, the right answer is to improve the data modeling, not to add more rules to the LLM prompt.