AI Schema Generation - Entity Enricher Documentation

AI Schema Generation

Generate structured JSON schemas from sample data using AI, with automatic self-correction and intelligent post-processing.

How It Works

Schema generation turns raw entity data into a typed, annotated JSON schema that defines exactly what information to extract during enrichment. Instead of manually writing schemas, you paste sample JSON and let AI analyze the structure, infer types, assign expertise domains, and suggest improvements.

The Generation Pipeline

Input preprocessing — Your sample JSON is analyzed. Localized objects (like {"en": "...", "fr": "..."}) are collapsed to a single value, and the property count determines how many expertise domains are allowed.
Prompt construction — An adaptive system prompt is built based on your data's complexity: whether it has nested objects, how many properties it contains, and whether multilingual fields were detected.
LLM generation with self-correction — The AI generates the schema. If any of the 8 validation rules fail, errors are sent back to the AI for correction — up to 6 total attempts.
Post-processing — Deterministic rules refine the schema: marking nullable fields, clearing empty search keys, and collecting expertise metadata.
Auto-save — The generated schema is automatically saved and deduplicated using content hashing, so identical schemas are not duplicated.

Self-Correction Loop

The self-correction loop is what makes schema generation reliable. After the AI produces a schema, it passes through a validator that checks 8 rules covering type correctness, expertise assignment, reference integrity, and data completeness. If any rule fails, the specific error message is sent back to the AI so it can fix the issue in its next attempt.

Example Self-Correction

Attempt 1AI generates schema. Validator detects: revenue: type mismatch — input is number but schema says 'string'

RetryError is sent back to the AI with context about what went wrong.

Attempt 2AI corrects the type to number. All 8 rules pass. Schema is accepted.

This approach is far more reliable than asking the AI to “be careful about types” in the prompt. The validator catches concrete errors and gives the AI precise feedback to fix them. Learn more about each rule in the Validation Rules guide.

What the Schema Contains

A generated schema is more than a simple type definition. Each property includes metadata that guides the enrichment process:

Type

JSON Schema type (string, number, integer, boolean, array, object)

Description

Contextual description that tells the AI what information to find

Expertise

Which expert domain (financial, regulatory, etc.) provides this value

Search Key

Whether this field identifies the entity (search) or deduplicates arrays (merge)

Nullable

Whether the field can be null, preventing unnecessary retries for optional data

Multilingual

Whether the field should be enriched across multiple languages

Preserve

Whether to keep the original value unchanged during enrichment

Examples

Realistic example values that guide the AI toward the right format

Expertise Domain Detection

The AI groups schema properties into expertise domains based on their semantic meaning. For example, a pharmaceutical company schema might have domains like “Financial Analyst,” “Regulatory Expert,” and “Corporate Information.” These domains are used by the multi-expertise strategy to run parallel, specialized LLM calls for deeper results.

Domain Count Limits

The number of expertise domains is automatically limited based on your data's property count to prevent over-fragmentation:

5 properties

1 domain

12 properties

2 domains

30 properties

5 domains

60 properties

10 domains

Post-Processing

After the AI generates a valid schema, three deterministic post-processing steps refine it based on your actual input data:

Nullable detection

Fields with null values in your input are automatically marked as nullable, so the AI won't waste retries trying to fill them.

Empty search key clearing

Search key flags are removed from fields with empty values (null, empty string, zero) since they can't help identify the entity.

Expertise collection

All unique expertise domains are gathered from the schema for metrics and strategy configuration.

Non-Determinism Check

Some property names don't pin down a single reproducible answer — annual_revenue (which year? which currency?) yields different values on every run. Generation fights this twice: the prompt itself requires deterministic names and descriptions, and after the schema is saved an analyzer annotates any property that still risks varying between models or runs. During sample generation the same analyzer even feeds corrections back to the AI for up to two extra rounds.

Flagged properties show a “varies” badge in the Schema Editor with a suggested fix. See the Non-Determinism Check guide for the four causes and their remedies.

Grounding Samples with Web Search

When generating a sample entity from a description, you can enable “Use web search” to let the model look up current facts on the web instead of relying on its training data alone. This produces fresher, more accurate sample values — especially for fast-moving facts like prices, staff counts, or recent releases. The option only appears for models whose provider supports built-in web search, and search calls are billed by the provider like any other model usage.

AI Schema Editing

After generation, you can modify schemas using natural language instructions. Type a command and the AI applies the change while preserving your existing schema structure. Each edit also produces 5 suggestions for further improvements.

Example Edit Commands

→Add an employee_count integer field

→Create a nested address object with city and country

→Add French descriptions to all text fields

→Define a parent company reference using $defs

→Mark the website field as nullable

AI edits are validated using a subset of the generation rules (type checking, reference integrity, expertise consistency) without comparing against input data, since you may intentionally add or remove fields.

AI Suggestions

Both schema generation and AI editing produce 5 targeted suggestions covering different improvement categories:

Data completenessMissing fields that could enrich your entity

Data qualityValidation patterns, format constraints

RelationshipsNested structures, entity references via $defs

InternationalizationMultilingual translations, locale support

Business contextDomain-specific fields and expertise groupings

Suggestions appear as clickable chips in the Schema Editor — click one to auto-fill the AI edit input and apply it.