Core Concepts - Entity Enricher Documentation

Core Concepts

Entity Enricher turns two kinds of knowledge into structured, validated data: what Large Language Models already know, and what sits unread in your own archives — PDF documents, images, audio recordings, office files. Every extracted object receives a stable semantic identity, so enrichments accumulate into a coherent information system instead of a pile of one-off results.

The Core Idea

Think of LLMs as distilled human knowledge — billions of documents, databases, and web pages compressed into queryable neural networks. Entity Enricher provides the interface to extract this knowledge in a structured, reliable format that fits your data model. And because modern models can also read PDFs, see images, and hear audio, the same interface extracts structure from your own content: the contracts, reports, scans, and recordings your company has accumulated for years.

Your Data & Archives

Partial records

Raw identifiers

PDFs & scans

Images & audio

Schema + LLM

“What do I want to know?”

Your Information System

Structured profiles

Classifications

Multilingual fields

Stable semantic IDs

Two Sources of Knowledge

Every enrichment draws on one or both of these sources. They complement each other: the model supplies world knowledge and reasoning; your documents supply the facts that only exist inside your organization.

1. The Model’s Training Knowledge

Public facts about companies, drugs, places, products, regulations — anything the model learned during training. Give it an identifier (a name, a website) and a schema, and it fills in the rest: industry, founding year, headquarters, mechanisms of action. No document required.

2. Your Unstructured Archives

The knowledge that never made it into a database: contracts, invoices, inspection reports, scanned forms, product photos, recorded calls. Attach them to an enrichment and the model extracts your schema’s fields directly from their content — no manual OCR, transcription, or copy-paste.

See Document Attachments for supported formats and delivery modes.

Three Pillars

1. The Schema: Your Question to the Knowledge Base

A schema is not just a data structure — it is a formalized question you are asking to the collective knowledge of humanity, or to a specific document. When you define a schema with properties like companyName, industry, and headquarters, you are essentially asking: “Given a company identifier, tell me its name, what industry it operates in, and where it is headquartered.”

Schema Concept	Purpose
Properties	The specific facts you want to extract
Types	The format you expect (string, number, object, array)
Expertise Domains	Which specialist should answer (pharmaceutical, financial, geographic)
Search Keys	Identifiers that help locate the entity in the knowledge base
Semantic ID	A stable, organization-scoped identity so the same real-world object is recognized across enrichments and your other systems
Preserve	Fields to pass through unchanged from your input
Multilingual	Fields delivered in every language you operate in — a first-class feature, not a bolt-on translation step

2. The LLM: Queryable Knowledge, Multimodal Reader

Large Language Models represent a new kind of knowledge base. Unlike traditional databases that return exact matches on stored records, LLMs understand context, reason about incomplete data, and generalize from patterns. And they are no longer text-only: vision-capable models read images and scanned pages, PDF-capable models ingest whole documents, and audio-capable models listen to recordings.

Entity Enricher treats multiple LLMs as different knowledge perspectives. Each provider brings its own strengths — Claude excels at nuanced reasoning, GPT-4 has broad knowledge, Gemini offers multilingual depth, and local Ollama models keep your data private.

Running the same enrichment across multiple providers lets you compare answers for confidence, aggregate consensus from multiple experts, and balance cost versus quality. Learn more about this in Multi-Model Enrichment.

3. The Enrichment: Structured Knowledge Extraction

Enrichment is the process of identifying the entity using search keys, retrieving relevant knowledge from the LLM and any attached documents, structuring the response according to your schema, validating the output matches expected types, preserving your original data where specified, and finally resolving identity — assigning each object its stable semantic ID.

Input

{ "name": "Novartis", "website": "novartis.com" }

Extract keys → Query LLM → Validate → Resolve identity

Output

{ "name": "Novartis", "industry": "Pharmaceutical", "foundedYear": 1996, "semantic_id": "cpt_abc123" }

From Enrichments to an Information System

Every enrichment is independent. Ask twice and the same real-world thing can come back described differently — “Acme Inc.” one day, “Acme Incorporated” the next; a drug side-effect as “Headache”, “Céphalée”, or “Cephalalgia” depending on language or model. To actually build on enriched data, you need a stable handle for the same entity.

A semantic ID is an organization-scoped identifier Entity Enricher assigns to an object from its key fields, matched by meaning, not exact spelling. The same entity resolves to the same ID across enrichments, models, languages, and time. It is assigned automatically after the model runs — never invented by the LLM — and can live on any object: the whole entity, a nested object, or each item in a list.

Enrichment run #1

“Acme Inc.”

same semantic ID

cpt_abc123

Run #2 — later, different model or language

“Acme Incorporated”

This is what turns a stream of enrichments into an information system you can grow and query:

Use	What it enables
Join key	A stable key to match enriched records against your warehouse, CRM, or master-data system
Deduplication	Collapse near-duplicates produced across batches, models, or years of documents into one identity
Reconciliation	Pass a known semantic ID back in and new facts attach to the entity you already track, instead of minting a new one
Knowledge graph	Objects referenced from multiple records converge on one node — relationships become queryable

How resolution works (exact-match cache, embeddings, similarity thresholds) is covered in Semantic IDs.

Mining Decades of Archives

Most companies sit on an archive that was never structured: shared drives of contracts and reports, scanned paper, email attachments, recorded meetings. That archive is a database — it just was never given rows and columns. Combining attachments (documents as a knowledge source), batch enrichment (parallel processing), and semantic IDs (deduplication across the whole corpus) turns it into one.

Archive files

Attach to enrichment

Schema as the extraction question

Validated structured records

Semantic identity & dedup

Your database

Batch at scale — entities are enriched in parallel with live per-entity progress, cost estimates up front, and selective retry for the few that fail
Guarded extraction — pre-flight classification and schema validation keep a mis-filed document from polluting your records with confident nonsense
Convergent identity — the same supplier appearing in a 2009 contract and a 2024 invoice resolves to the same semantic ID, so the archive collapses into clean master data
Out through the API — results export as validated JSON or flow straight into your systems via the REST API and connectors (n8n, Make, MCP)

See Batch Enrichment for the workflow in detail.

Beyond Text: Multimodal Sources

Structured knowledge does not only live in text. Entity Enricher accepts the formats your archive actually contains and routes each one to models capable of reading it.

PDF documents

Whole documents with layout, tables, and figures — read natively by PDF-capable models

Images

Photos, scans, diagrams, product shots — interpreted by vision models, no separate OCR step

Audio

Recorded calls, meetings, and voice notes — heard directly by audio-capable models

Office & text

Word, Excel, PowerPoint, HTML, CSV, Markdown — text extracted server-side and inlined

Two delivery modes make this work. In binary mode, the original bytes go to the model so nothing is lost in conversion — a table’s layout, a photo’s detail, a speaker’s words. In inline-text mode, text is extracted once at upload and inlined into every prompt, which works with any model regardless of its capabilities.

Capability-aware routing means a file only reaches models that can actually process it — you are warned before an enrichment starts, not after it fails. Formats and modes are detailed in Document Attachments.

Expertise Domains: Consulting the Right Specialist

Not all knowledge is equal. A question about drug mechanisms requires different expertise than a question about corporate structure. Expertise domains route schema properties to the right specialist within the LLM, activating the relevant knowledge patterns for each domain.

pharmaceutical

Drug names, mechanisms, indications, regulatory status

business_classification

Industry codes, company types, market segments

geographic

Locations, regions, country-specific information

financial

Revenue, market cap, funding rounds

temporal

Dates, periods, historical events

regulatory

Approvals, licenses, compliance status

When using the multi-expertise strategy, each domain gets its own focused LLM call with only the relevant schema properties, improving output quality significantly.

Quality Controls

Validation and Self-Correction

LLMs can make mistakes. Entity Enricher implements multiple layers of quality control to catch and fix errors automatically:

Type Validation — Ensures output matches schema types (string, number, boolean, etc.)
Expertise Validation — Verifies all expertise domains are defined and contain properties
Self-Correction — When validation fails, errors are sent back to the LLM for automatic correction (up to 5 retries)
Preserve Logic — Original values for preserved fields are restored after enrichment, ensuring data integrity

Search Keys: Anchoring Identity During Enrichment

Search keys prevent the LLM from hallucinating about the wrong entity. They serve two roles:

Search keys (name, website) — Lookup identifiers that help the LLM find the right entity
Merge keys (product_name in arrays) — Deduplication keys for matching array items when merging results from multiple models

The enrichment prompt emphasizes: “You are enriching this specific entity identified by these search keys.”

Search keys and semantic IDs are two sides of identity: search keys help the LLM find the right entity during enrichment; semantic IDs give it a persistent identity your systems rely on after enrichment.

Pre-flight Classification

Before enrichment begins, an optional pre-flight classification step can verify that the entity actually matches the schema type. This prevents hallucination when entities do not match — for example, enriching “Titan” against a “Planet” schema when Titan is actually a moon.

Cost Awareness

LLM calls have costs. Entity Enricher tracks token usage, cost per provider, cost per enrichment, and organization-scoped spending. This enables budget monitoring, provider comparison (cost vs. quality), and optimization decisions like using cheaper models for simple fields — which matters most when processing an archive of thousands of documents.

Summary

Component	Conceptual Role
Schema	The question you are asking
LLM Providers	Different knowledge perspectives
Attachments	Your archives as a knowledge source (PDF, image, audio, office)
Search Keys	Entity identity anchors during enrichment
Semantic IDs	Stable identity after enrichment — the backbone of your information system
Expertise Domains	Specialist routing
Strategies	How to orchestrate LLM calls
Batch Processing	Parallel enrichment at archive scale
Multilingual	The same fact in every language you operate in
Validation	Quality assurance
Preserve	Data integrity protection

Next Steps

Enrichment Flow

Step-by-step walkthrough of the enrichment pipeline

Semantic IDs

Stable entity identity for deduplication and interoperability

Document Attachments

PDFs, images, audio, and office files as enrichment sources

Batch Enrichment

Parallel processing for lists and archives

Enrichment Strategies

Compare single-pass vs multi-expertise approaches

Multi-Model Fusion

Conflict detection and resolution across models