Pre-flight Classification - Entity Enricher Documentation

Pre-flight Classification

Pre-flight classification verifies that an entity matches the expected schema type before enrichment begins. This optional step prevents hallucination and wasted tokens when entities do not match your schema.

Why Classify Before Enriching?

LLMs are eager to help. When asked to enrich an entity against a schema, they will produce structured output even if the entity does not match the schema type at all. This leads to hallucinated data that looks plausible but is entirely wrong.

The Hallucination Problem
Without Classification

Schema: “Planet” — Entity: “Titan”

The LLM treats Titan as a planet and invents data: orbital period, atmosphere composition, number of moons — all plausible-looking but wrong. Titan is actually a moon of Saturn.

With Classification

Classification detects: “mismatch — Titan is a moon, not a planet”

The enrichment models receive this context, set irrelevant fields to null, and only fill in properties that genuinely apply to the entity.

How It Works

Classification runs as a single, fast LLM call before any enrichment models begin. It uses a cheap, quick model (such as Claude Haiku or GPT-4o Mini) to minimize cost.

1
Send schema type and entity data
The classification model receives the schema name, description, and entity data (truncated to 3,000 characters to keep costs low).
2
Receive structured classification
The model returns a structured response with a status (match, mismatch, unknown, or ambiguous), a description of what the entity actually is, confidence level, and reasoning.
3
Inject context into enrichment
The classification result is prepended to every enrichment prompt as a “Pre-flight Classification” section. This gives enrichment models critical context about the entity type.

Four Classification Statuses

Match

The entity matches the schema type. Enrichment proceeds with high confidence.

Prompt Effect
Confirms the entity type and provides additional context to the enrichment models.
Example
Schema "Pharmaceutical Company", Entity "Sanofi" — confirmed as a pharmaceutical company.
Mismatch

The entity is a different type than the schema expects. The classification explains what the entity actually is.

Prompt Effect
Warns enrichment models that the entity does not match. Instructs them to use null for irrelevant fields.
Example
Schema "Planet", Entity "Titan" — identified as a moon of Saturn, not a planet.
Unknown

The entity cannot be identified with certainty. The LLM does not have enough information to classify it.

Prompt Effect
Tells enrichment models to use null when uncertain rather than guessing.
Example
Schema "Pharmaceutical Company", Entity "XYZ Corp" — not enough information to determine the entity type.
Ambiguous

Multiple valid interpretations exist. The classification lists the alternatives.

Prompt Effect
Lists the possible interpretations and asks enrichment models to pick the most likely one.
Example
Schema "Company", Entity "Mercury" — could be the planet, the element, or Mercury Insurance.

Key Properties

Non-blocking

Classification is purely advisory. If the classification call fails for any reason (model error, timeout, rate limit), enrichment proceeds normally without classification context. This ensures that the optional classification step never prevents enrichment from completing.

Cost-Effective

Classification is designed to run on fast, inexpensive models. It sends a minimal payload (schema name, description, and truncated entity data) and expects a small structured response. The typical cost is a fraction of the enrichment itself — well worth the accuracy improvement.

Real-Time Feedback

The UI shows classification progress in real-time via Server-Sent Events. A classification_started event fires when the check begins, followed by classification_completed with the status, confidence, and entity description. The result appears as a banner above the model results.

Cancellable

If you cancel the enrichment during the classification phase, the job stops immediately without starting any enrichment models. No unnecessary tokens are spent.

When to Enable Classification

Recommended
  • Schemas with a narrow entity type (e.g., “Pharmaceutical Company”)
  • Input data that may contain mixed entity types
  • Batch enrichment with entities from diverse sources
  • When using expensive enrichment models and you want to avoid waste
Not Necessary
  • Generic schemas that accept any entity (e.g., “Organization”)
  • Curated input data where you control the entity type
  • Quick iterations where speed matters more than accuracy
  • Schemas without a clear entity type definition

How to Enable

In the Schema Editor or Batch Enrichment sidebar, look for the “Classification” dropdown. Select a fast, inexpensive model (Claude Haiku, GPT-4o Mini, or similar). The classification will run automatically before enrichment begins for each entity.

When using the REST API, include the classification_model field in your enrichment request with the model's composite key (e.g., anthropic::claude-haiku-4-5).