Semantic IDs - Entity Enricher Documentation

Semantic IDs

Enrich the same kind of entity again and again and you keep re-discovering the same real-world things — the same company, the same drug side-effect, the same person — described with slightly different words each time. A semantic ID is a stable, organization-scoped identifier Entity Enricher assigns to an object from its key fields, so those near-duplicates collapse to one identity you can group, deduplicate, and join on.

The problem: same thing, different words

An object’s identity is built from its key fields — and there can be one or several. Two examples:

One key

A side-effect keyed by `name`

It shows up as Headache, Céphalée, and Cephalalgia across runs and languages. One key field, three spellings, one real concept.

Two keys

A company keyed by `name` + `country`

Acme Inc. · United States and Acme Incorporated · United States are the same company — while Acme Inc. · Germany is a different one. The second key disambiguates; that’s why an object can carry more than one.

Plain string matching fails on all of these; a human knows which are the same. Semantic IDs encode that judgement automatically.

What a semantic ID is

•A single string property on an object (named id by default), holding an opaque, stable identifier.
•Stable & organization-scoped — the same real-world thing resolves to the same ID across enrichments, batches, and time, within your organization only.
•Assigned automatically, never by the LLM — it’s computed in a post-enrichment pass, so the model can’t hallucinate it. It’s a pass-through (preserve) field: always a string, never a key, never multilingual, at most one per object.
•Allowed on any object — the whole entity (root), a 1-1 nested object (e.g. a manufacturer), or each item in an array (e.g. each side_effect).

How it works

After the model returns its result, Entity Enricher resolves each semantic ID in four steps — cheapest first:

Compose the identity text

Join all of the object’s key fields — plus the keys of any 1-1 nested objects it contains — into a single string, in your primary language. Items inside arrays are not pulled in: each array item owns its own identity. The text is normalized (lowercased, parentheticals dropped, whitespace collapsed) to shrink trivial differences.

Look for an exact match

If that exact normalized text has been seen before in your organization, its existing ID is reused immediately — no model call, no cost.

Embed & compare

Otherwise the text is embedded and compared, by meaning, against existing concepts of the same type using vector similarity — so “Acme Inc.” and“Acme Incorporated” land next to each other.

Reuse or mint

If the closest match scores above the similarity threshold (default 0.92, tunable per property), that concept’s ID is reused. Otherwise a brand-new ID is minted and stored for next time.

Threshold trade-off: a higher threshold is stricter (fewer accidental merges); a lower one is looser (more aggressive deduplication). Tune it per property when the default 0.92 over- or under-merges.

Input IDs vs. generated IDs

Whether an ID is generated depends on whether one is already present in the input for that object. This is what lets you round-trip: enrich once to obtain IDs, then pass a known ID back on later runs to attach new facts to the same identity — cheaper and unambiguous.

ID already in the input → kept (lookup)

If the object you send already carries a semantic ID, it’s treated as a lookup: the ID is kept verbatim, the record is linked to that existing concept, and there is no embedding — no cost, no match-or-mint. You’re telling the platform “this object is already identified in our database.”

No ID in the input → generated

If the object has no semantic ID, the platform generates one with the four steps above. That ID becomes the object’s stable identifier in your organization’s database from then on.

A present-but-unrecognizable value (not a real concept ID) is ignored, and an ID is generated instead.

How to enable it

Pick an embedding model (once per organization)

An owner chooses an embedding-capable model in Model Management as the organization’s default embedding model. It’s near-immutable: once concepts exist it can only be cleared, not switched (stored vectors aren’t comparable across models). Without it, semantic IDs are simply skipped.

Add semantic IDs to the schema

Two ways, both in the Schema Editor:

Automatically at generation — tick “Generate semantic IDs for types”; every object with a key (its own, or one on a 1-1 nested object) gets one, including the root entity.
Manually — use the “+ Add semantic ID” control on any object or the entity footer.

Resolution costs a small amount of embedding usage per enrichment (metered like any model call). The exact-match cache makes repeats free, and input-provided IDs cost nothing.

Where the IDs show up & what to do with them

Resolved IDs appear in the enrichment output JSON (the id field on each object) and in the record detail’s semantic concepts. Use them to:

•Deduplicate rows across batches and over time.
•Build a stable join key for your data warehouse or CRM.
•Reconcile the same entity seen on different days or in different languages.

Complements multi-model fusion

Fusion reconciles disagreements across models within a single run; semantic IDs reconcile the same entity across runs and time. The two work together.

Core Concepts

Search & merge keys, which semantic IDs build on

Multilingual Enrichment

Collapse cross-language spellings to one identity

Multi-Model Fusion

Reconcile across models within one run

Schema Editor

Add semantic IDs to any object