Cost Optimization & Prompt Caching - Entity Enricher Documentation

Cost Optimization

With LLM enrichment, the bill is the tokens. Entity Enricher is built to send as few billed tokens as possible without sacrificing accuracy — led by prompt caching, and backed by schema scoping, smart gating, and fewer wasted retries. Most of it happens automatically; nothing here requires extra configuration.

Where the Cost Goes

Every enrichment call pays for input tokens (your prompt, schema, and any attached documents), output tokens (the structured result), and — if enabled — web-search queries. The largest, most repetitive part is usually the input: the same system instructions, schema description, and source documents get re-sent on every call. Caching that shared input is the single biggest lever, so it comes first.

Input tokens

Prompt + schema + attachments. Large and highly repetitive across calls — the prime target for caching and scoping.

Output tokens

The structured result. Kept lean by asking each model only for the fields it actually owns.

Wasted spend

Failed retries, rate-limit thrashing, and enriching the wrong entity. Eliminated up front rather than paid for.

Prompt Caching

When a multi-expertise enrichment runs, it makes several LLM calls for the same entity — one per expertise domain. Every one of those calls shares the same opening context: the generic system instructions and any inline-text documents you attached. Entity Enricher keeps that shared prefix byte-for-byte identical across calls and marks it as cacheable, so the provider stores it once and re-reads it on every subsequent call at roughly a tenth of the normal input price.

How a cache hit changes the bill
Without caching

Each of the N calls re-sends the full shared context at full input price. Five expertises means paying for that big shared block five times.

With caching

The shared block is written to cache once, then read back on the other four calls at ~10% of input price. The savings grow with every extra expertise, language, and attached document.

Cache warm-up

Provider caches are only readable after the first request that writes them finishes. If all the expertise calls fired at once, none would find a warm cache and each would redundantly write its own copy. So when caching applies, the first call runs on its own, a brief moment is allowed for the cache to propagate, and only then are the remaining calls launched in parallel — so each one reads the warm cache instead of paying to rewrite it.

Works across providers and attachments

Anthropic models cache the shared instructions explicitly; attached PDFs and images are cached in place; and providers with automatic prefix caching (OpenAI, xAI, DeepSeek and others) benefit from the same byte-identical prefix. Caching pays off most exactly when input is large — many expertises, multiple languages, or attached documents.

You only pay for what isn't cached

Cost accounting is cache-aware: cached input tokens are billed at the model's cache-read rate (a fraction of the input rate), and only the genuinely new tokens are billed at full price. The savings show up directly in your cost analytics, not just in theory.

Smaller Payloads per Call

Beyond caching the shared prefix, Entity Enricher shrinks the part of each call that isn't shared.

Per-expertise schema subsetting

Each expertise call only receives the slice of the schema it is responsible for, not the entire schema.

A financial expert never sees the regulatory fields. Fewer fields means fewer tokens in and out — and the response is pruned back to its slice before merging.

Schema-less text channel

When documents are attached and you have not opted into a strict structured-output mode, the field list lives only in the readable prompt — no schema is duplicated on the wire.

This drops the schema tokens entirely and keeps the shared prefix identical (so it caches better). The reply is still validated client-side, with automatic self-correction on drift.

Don't Pay to Enrich the Wrong Thing

Optional pre-flight classification runs a single, cheap, fast model to check whether an entity actually matches your schema before any expensive multi-model enrichment begins. A mismatch — like a moon sent to a “Planet” schema — is caught for a fraction of a cent instead of burning a full enrichment across several premium models.

It is non-blocking (if the check fails, enrichment proceeds anyway) and cancellable, so you never start paying for models you decided to skip.

Fewer Wasted Retries

A failed validation round is a full-price LLM call with nothing to show for it. Two mechanisms keep retries rare and productive.

Output normalization

Common LLM output quirks — index-keyed objects that should be arrays, the string 'null', stray escaped quotes — are corrected before validation runs.

Many would-be validation failures are fixed silently, so they never trigger a paid retry at all.

Targeted self-correction

When a retry is genuinely needed, the exact validation error is fed back to the model so it can fix that specific problem.

Clear, specific feedback raises the odds the next attempt succeeds, instead of burning attempts on vague guidance.

Right Strategy, Controlled Concurrency

Pick the strategy that fits the schema

Single-pass is cheapest for small schemas; multi-expertise is built for large ones, where caching plus per-expertise scoping more than pay for the extra calls. See Strategies for when to use each.

Rate limiting avoids costly thrashing

A per-provider concurrency limit keeps jobs from hammering a provider into rate-limit errors, which would otherwise trigger backoff and retries — wasted tokens and wall-clock time. Throttled, steady concurrency is cheaper than fighting 429s.

Full Cost Visibility

Every enrichment records its real token counts — including cached reads — and the resulting cost. The Cost Dashboard turns that into time-series charts and per-model breakdowns, so you can see exactly where spend goes and confirm that caching and scoping are doing their job. Pricing you see is the price you are billed; raw provider costs and any platform markup are kept transparent.