Benchmark Scoring - Entity Enricher Documentation

Benchmark Scoring

Scoring turns a benchmark from “eyeball the JSON” into an objective number. Each model’s result is graded against a gold reference — the expected output — producing completeness, correctness, and an overall quality score you can sort on.

The gold reference

Scoring needs something to score against. Each scenario carries a reference output: the correct answer for its one fixed entity. Build it by generating with strong models (web search + a source-of-truth document), by pasting a known-good result, then editing it by hand — and mark it verified once you trust it. A verified reference is required to benchmark the scenario at all, so there’s always something to grade against. If you later edit the reference — or change the scenario’s scoring config — existing scores are flagged stale until you re-score.

How values are compared

The core problem: two correct answers can be written differently. A model that names an actor “R. Downey Jr.” instead of “Robert Downey Jr.” isn’t wrong. So each field is compared with a tiered ladder — cheapest and most certain first, escalating only when needed:

Exact & normalized

Identical values match. So do values that differ only in case, surrounding whitespace, or numeric precision ("Acme" = "ACME", 4.0 = 4). Free and fully deterministic.

Embedding similarity

For text, the candidate and reference are embedded and compared by cosine similarity. Above the threshold they count as the same — so a valid alternate spelling like "R. Downey Jr." vs "Robert Downey Jr." is a match, not an error. Dates are the exception: they are compared as calendar values, never by similarity, so a near-but-wrong date ("1972-03-14" vs "1972-03-24") is a clean mismatch rather than a deceptively high cosine. Booleans are likewise exact-or-nothing.

LLM judge

Values too close to call by similarity — all free-text fields like summaries and descriptions, and every non-identical number — are sent to a judge model, which grades 0–100 how well the answer captures the reference’s meaning. It rewards a correct answer worded differently or more briefly, and gives a number partial credit when the field tolerates it (a molecular weight of 273.37 vs 273.35, a half-life of 12 vs 15) while still failing it where exactness matters (a release year of 2020 vs 2023). Without a judge, free text falls back to a continuous similarity score, and a non-identical number is simply a mismatch.

A strictness setting controls the embedding threshold: higher means two differently-written values must be more similar to count as the same. The strictness, the optional judge model, and the embedding model are all set on the scenario — not chosen each time you score — so every model is graded identically and scores stay comparable.

Scoring arrays (lists of items)

Lists — a film’s cast, a drug’s side effects — are where models differ most: a small model might find 4 actors where a strong one finds 15. Order doesn’t matter, and finding more correct items should win. So arrays are scored as a set, not position by position:

Each candidate item is matched to a reference item with the same ladder as fields, cheapest first: by its key field, then by identical text, then by embedding similarity, and finally — for the paraphrased remainder — by a single LLM set-alignment call that lines up the leftover items in one shot (only when the scenario has a judge).
Recall rewards coverage — finding 15 of 15 beats 4 of 15.
Precision punishes invented items — a hallucinated extra actor lowers the score.
F1 combines the two, and each matched pair is scored field-by-field, so “right actor, wrong role” still counts against you.

Expand a result row to see exactly which items were matched, missed, or hallucinated.

Reading the score

A single number hides too much, so every result carries sub-scores:

Completeness — did the model fill what the reference filled? (missing data hurts this)
Correctness — of what it did fill, how much is right?
Hallucination — how much did it invent that the reference doesn’t support?
Overall — a weighted blend, with identifier (key) fields weighted more heavily.

The expandable row shows the per-field breakdown: candidate vs reference, which rung of the ladder decided it, and the similarity where relevant.

When a scenario runs a model more than once (repetitions), each run is scored on its own and the row shows the mean quality plus a consistency spread (lowest–highest of the runs) — so a model that is right on average but erratic is easy to spot. The visible output is the median-by-quality run.

Cost & what runs

Scoring is a separate pass over already-saved results — it never re-enriches, so it never re-pays for the models under test. It does embed text to compare values (and run the judge, if the scenario has one), which deducts credits based on usage. This happens automatically at the end of every run, and again whenever you re-score. If your organization has no embedding model configured (and the scenario sets no override), scoring still runs but falls back to exact matching only (alternate spellings then count as mismatches), and says so.

Where to find it

In Model Management → Benchmarks, set and verify a reference in the scenario editor (and pick its judge model, embedding model, and strictness there). From then on, every run auto-scores its successful results — a sortable Quality column fills in with no extra step. Use Re-score results (the header button or the ··· menu) to re-grade after you edit the reference or the scoring config.

Model Benchmarks

Saved scenarios, runs, and side-by-side output & cost.

Semantic IDs

The embedding resolution that also powers equivalence matching.