AI Schema Generation - Entity Enricher Documentation

AI Schema Generation

Generate structured JSON schemas from sample data using AI, with automatic self-correction and intelligent post-processing.

How It Works

Schema generation turns raw entity data into a typed, annotated JSON schema that defines exactly what information to extract during enrichment. Instead of manually writing schemas, you paste sample JSON and let AI analyze the structure, infer types, assign expertise domains, and suggest improvements.

The Generation Pipeline

  1. Input preprocessing — Your sample JSON is analyzed. Localized objects (like {"en": "...", "fr": "..."}) are collapsed to a single value, and the property count determines how many expertise domains are allowed.
  2. Prompt construction — An adaptive system prompt is built based on your data's complexity: whether it has nested objects, how many properties it contains, and whether multilingual fields were detected.
  3. LLM generation with self-correction — The AI generates the schema. If any of the 8 validation rules fail, errors are sent back to the AI for correction — up to 6 total attempts.
  4. Post-processing — Deterministic rules refine the schema: marking nullable fields, clearing empty search keys, and collecting expertise metadata.
  5. Auto-save — The generated schema is automatically saved and deduplicated using content hashing, so identical schemas are not duplicated.

Self-Correction Loop

The self-correction loop is what makes schema generation reliable. After the AI produces a schema, it passes through a validator that checks 8 rules covering type correctness, expertise assignment, reference integrity, and data completeness. If any rule fails, the specific error message is sent back to the AI so it can fix the issue in its next attempt.

Example Self-Correction

Attempt 1AI generates schema. Validator detects: revenue: type mismatch — input is number but schema says 'string'
RetryError is sent back to the AI with context about what went wrong.
Attempt 2AI corrects the type to number. All 8 rules pass. Schema is accepted.

This approach is far more reliable than asking the AI to “be careful about types” in the prompt. The validator catches concrete errors and gives the AI precise feedback to fix them. Learn more about each rule in the Validation Rules guide.

What the Schema Contains

A generated schema is more than a simple type definition. Each property includes metadata that guides the enrichment process:

Type

JSON Schema type (string, number, integer, boolean, array, object)

Description

Contextual description that tells the AI what information to find

Expertise

Which expert domain (financial, regulatory, etc.) provides this value

Search Key

Whether this field identifies the entity (search) or deduplicates arrays (merge)

Nullable

Whether the field can be null, preventing unnecessary retries for optional data

Multilingual

Whether the field should be enriched across multiple languages

Preserve

Whether to keep the original value unchanged during enrichment

Examples

Realistic example values that guide the AI toward the right format

Expertise Domain Detection

The AI groups schema properties into expertise domains based on their semantic meaning. For example, a pharmaceutical company schema might have domains like “Financial Analyst,” “Regulatory Expert,” and “Corporate Information.” These domains are used by the multi-expertise strategy to run parallel, specialized LLM calls for deeper results.

Domain Count Limits

The number of expertise domains is automatically limited based on your data's property count to prevent over-fragmentation:

5 properties
1 domain
12 properties
2 domains
30 properties
5 domains
60 properties
10 domains

Post-Processing

After the AI generates a valid schema, three deterministic post-processing steps refine it based on your actual input data:

Nullable detection

Fields with null values in your input are automatically marked as nullable, so the AI won't waste retries trying to fill them.

Empty search key clearing

Search key flags are removed from fields with empty values (null, empty string, zero) since they can't help identify the entity.

Expertise collection

All unique expertise domains are gathered from the schema for metrics and strategy configuration.

AI Schema Editing

After generation, you can modify schemas using natural language instructions. Type a command and the AI applies the change while preserving your existing schema structure. Each edit also produces 5 suggestions for further improvements.

Example Edit Commands

Add an employee_count integer field
Create a nested address object with city and country
Add French descriptions to all text fields
Define a parent company reference using $defs
Mark the website field as nullable

AI edits are validated using a subset of the generation rules (type checking, reference integrity, expertise consistency) without comparing against input data, since you may intentionally add or remove fields.

AI Suggestions

Both schema generation and AI editing produce 5 targeted suggestions covering different improvement categories:

Data completenessMissing fields that could enrich your entity
Data qualityValidation patterns, format constraints
RelationshipsNested structures, entity references via $defs
InternationalizationMultilingual translations, locale support
Business contextDomain-specific fields and expertise groupings

Suggestions appear as clickable chips in the Schema Editor — click one to auto-fill the AI edit input and apply it.

Next Steps