Benchmark scenarios let you compare LLM models on a real, repeatable enrichment task — apples to apples — capturing each model’s output and total cost so you can pick the right model for the job.
Models differ wildly in accuracy, structured-output reliability, and price. Rather than guessing, a benchmark scenario runs the same schema and entity through many models at once and records what each produced and what it cost. You compare on evidence, then lock in the cheapest model that meets your quality bar.
A benchmark scenario is a saved, reusable enrichment test: a schema, a fixed entity input (search keys or raw JSON), an enrichment strategy, languages, the response-schema / strict-structured-output toggles, and any attachments. It also holds its gold reference and how results are graded against it (an optional judge model, an embedding model, and a strictness threshold). Define it once and reuse it across every model you want to compare.
Once the scenario has a verified reference, run it against one provider’s active models or every active model in view. Each model is enriched independently — no fusion — so you get a clean, side-by-side result per model. Progress streams live, and each successful result is automatically scored against the reference as the run finishes.
Every run is saved with its structured output, success status, token counts, processing time, and total billed cost. Expand any row to inspect the JSON output or jump to the underlying enrichment record.
Re-running a scenario on the same model overwritesits previous result, so the table always reflects the latest run. Edit a scenario’s config and older results are flagged stale until you re-run them. Set Runs per model to 2 or 3 and each model is benchmarked that many times — the table keeps the mean of cost, quality, and speed plus a consistency spread (models vary run‑to‑run), at roughly that multiple of the credits.
The results table is built for comparison. A summary strip across the top calls out the success rate and the cheapest and fastest models that succeeded. Every column — model, status, strategy, cost, tokens, and time — is sortable, so one click ranks models by price or latency. Filter by model name, status, or strategy to narrow the view, and expand any row to read the full structured output or open the underlying enrichment record.
Benchmarking is iterative. Tick rows with the checkboxes (shift-click for a range), then use the ··· menu to act on a subset without re-running everything:
Every scenario holds a reference result — the expected output for its entity — and a scenario can only be benchmarked once that reference is verified. Until then it won’t appear in any run menu. The reference is the baseline for judging quality: how close each model gets, field by field, and (for lists like a movie’s cast) how many of the correct items it actually found. You set it — along with the judge model, embedding model, and strictness used to grade against it — right in the scenario editor.
Build it two ways. Generate it: attach a document that contains the correct values (a datasheet, an official page), turn on web search, and run a few strong models — they extract the answer from your source rather than from memory, so the result is grounded in truth, not guesswork. Or paste a known-good result you already have. Either way you review the JSON, correct anything, and mark it verified — an explicit sign-off that this is the gold answer.
Because the reference is grounded and human-checked once, it doubles as a trustworthy yardstick you reuse across every model and every future run.
Benchmarks live in Model Management → Benchmarks(available to organization owners and admins). Create and manage scenarios there, or launch a run from any of four places: the Benchmark modelsbutton in the toolbar (all active models in view), the Benchmark models action on any provider row (that provider’s active models), the Benchmark dropdown that appears when you select models in the Models panel (the selected models), or the Benchmark model action on any single model row.
Benchmark runs make real LLM calls and deduct credits based on actual usage, exactly like a normal enrichment. The confirmation dialog tells you how many models you’re about to run before any spend happens. Each saved result shows its billed cost, so a benchmark doubles as a cost-comparison tool.