When LLMs produce structured data, they can fabricate plausible-looking facts. Entity Enricher uses 8 defense layers to ensure you get accurate data or no data — never confident-sounding fiction.
In free text, a hallucinated sentence is obviously vague. In structured output, a hallucinated field like "founded_year": 1987 looks authoritative and is nearly impossible to distinguish from a correct value. Three factors make this especially dangerous:
A hallucinated JSON value looks exactly like a real one. There is no hedging, no “approximately” — just a clean, confident data point that happens to be wrong.
Required fields force the LLM to produce a value even when it has no knowledge. The model invents data rather than leaving a gap in the structure.
Structured data feeds directly into databases, analytics, and automations. A wrong value propagates through pipelines without human review.
| Pattern | Example | Cause |
|---|---|---|
| Confident fabrication | "ceo": "John Smith" | LLM fills required field with plausible name |
| Temporal confusion | "revenue": "$2.3B" | Training data cutoff or conflation of periods |
| Entity conflation | Attributes from Company A on Company B | Similar names in overlapping training data |
| Plausible defaults | "employees": 500 | LLM picks a “reasonable” number over admitting ignorance |
| Invented relationships | "subsidiary_of": "Alphabet" | LLM infers a relationship that does not exist |
Entity Enricher does not rely on a single technique. It stacks 8 independent defense layers, each targeting a different failure mode. If one layer misses a hallucination, the next catches it.
Before enrichment begins, a fast LLM classifies whether the entity matches the schema type. This blocks whole-entity hallucination at the source.
Example: "Titan" against a "Planet" schema is flagged as a moon — enrichment models receive this context and use null for planet-specific fields.
All strategies instruct the LLM: "Be accurate and conservative — prefer null over guessing." Nullable schema fields give the model explicit permission to say "I don’t know."
This directly addresses schema pressure — the #1 cause of structured hallucination.
Schema properties are grouped by expertise domain. Each LLM call only sees fields within its domain, with instructions to focus exclusively on that area.
Narrower scope means less opportunity to hallucinate. A financial expert never guesses about regulatory data.
Key properties (marked is_key: true) are highlighted in prompts to anchor the LLM on identifying information before filling other fields.
This grounds the model on known facts, reducing drift toward fabricated details.
8 validation rules check LLM output for type mismatches, invalid references, and structural errors. Failed validation triggers ModelRetry — errors are sent back to the LLM for correction.
Up to 6 automatic correction attempts within a single agent run. The LLM fixes its own mistakes.
Fields marked preserve: true (IDs, SKUs, timestamps) are restored to their original input values after enrichment. The LLM cannot overwrite ground truth data.
Protected fields: entity IDs, system codes (EAN, SKU), import identifiers, creation timestamps.
Running the same entity through 2+ independent models and comparing outputs field-by-field. Disagreements are flagged as potential hallucinations.
If Claude says revenue is $2.3B and GPT-4 says $1.8B — that conflict is detected and surfaced.
Detected conflicts are resolved by rule-based voting (majority, median, union) or by a dedicated LLM arbitrator that evaluates accuracy, completeness, and consistency.
Each arbitration decision includes reasoning and confidence level — full transparency into how conflicts were resolved.
Core Principle
Missing data is always better than wrong data. Every layer reinforces this principle — the system is designed to return null rather than a plausible-sounding fabrication.