How we measure whether an agent's work is actually good

We publish one quality metric: 91% of agent deliverables are accepted without major revision. That number does a lot of work in our sales conversations, which means it deserves scrutiny — what counts, what doesn't, and how it is measured. This post is the long answer. It also explains why "passes tests" is the wrong bar for agent output, especially for a code review agent, where the tests measure the code and not the review.

The short version: quality measurement at FuturOne is three systems stacked on each other. Rubrics score artifacts. Calibration makes the agent's confidence mean something. Escalation thresholds decide what ships at all. The 91% only makes sense once you understand all three.

Rubric-based artifact scoring

Every artifact type has a rubric, versioned in our repo next to the workflow definitions, because an unversioned rubric drifts into whatever the loudest reviewer wanted last month. A code review is scored on finding correctness (is each flagged issue real?), severity calibration (does "critical" mean critical?), coverage (did it miss what a strong human reviewer would have caught?), convention adherence, and actionability of the suggested fixes. A research memo is scored on claim-source linkage, source quality, coverage of disconfirming evidence, and recency. Each dimension is scored 0–4 against anchored descriptions, not gut feel:

dimension: finding_correctness
weight: 0.35
scale:
  4: finding is real; root cause correctly identified
  3: finding is real; root cause partially identified
  2: finding is real; severity off by one level
  1: finding is technically real but immaterial in context
  0: finding is not a real issue

Who grades matters as much as the rubric. Graders are separate model passes with fresh context: they see the artifact, the inputs, and the rubric — never the producer's reasoning. When we let the producing model grade its own work in the same context early in the beta, scores ran roughly 0.6 points hot on a 4-point scale; the grader inherits the producer's blind spots along with its context. We also route grading to a different model than the one that produced the step where practical, which the model-agnostic routing layer makes cheap.

Model graders are themselves audited. A sample of runs is human-graded every week, and we track model–human agreement per domain; when agreement drops below our floor, that rubric gets revised before we trust another week of its scores. Humans do not agree perfectly with each other either — inter-reviewer agreement tops out well short of 1.0 — which is worth remembering: there is a ceiling on how precise any of this can be, and pretending otherwise is its own failure mode.

Confidence calibration

Every deliverable ships with confidence scores — per section for research memos, per finding for code reviews. A confidence score is only information if it is calibrated: when the agent says 90%, the acceptance rate of those items should be about 90%. So we treat calibration as a measured property, not an aspiration. Stated confidence is binned, plotted against actual acceptance outcomes, and tracked per domain alongside a Brier score.

It drifts. Last fall, the research agent ran about twelve points overconfident on claims supported by sparse sources — it confused "I found nothing contradicting this" with "this is well supported." The fix was twofold: the source verification pass that shipped this week now distinguishes corroborated claims from merely uncontradicted ones, and we re-fit the calibration map (isotonic regression over rolling 90-day outcomes) per domain. Calibration is also re-fit any time model routing changes, because a confidence map tuned for one model's habits does not survive a model swap.

Escalation thresholds: permission to say "I'm not sure"

Below a confidence threshold, the agent does not deliver with a shrug — it escalates. An escalation names the specific sections or findings in doubt and what would resolve them: a source it could not verify, a requirement that reads two ways, a test it could not run. The reviewer starts from the doubt, not from scratch.

Setting the threshold is a product decision wearing a statistics costume. Too low, and junk ships with a confident face. Too high, and the agent escalates everything, saving nobody any time — an agent that always asks is a search engine with extra steps. We tune thresholds per domain to hold escalations in roughly the 7–10% band of production runs, and we revisit the band when calibration shifts.

This is the part most evaluations skip: the 91% acceptance number is only trustworthy because the agent is allowed to decline to deliver. Remove escalation and acceptance falls — you would be forcing the agent's most uncertain work into reviewers' hands stamped as done.

What the 91% actually counts

Precisely: 91% of delivered artifacts are accepted without major revision, measured on production runs over a trailing quarter — not on a benchmark set we curated.

"Major revision" is defined, not vibes: the reviewer reverses a conclusion, removes findings as incorrect, or rewrites more than a quarter of the artifact's substance. Tone, formatting, and house-style edits do not count against the agent. The line is documented for customers so dispositions are comparable across teams.
Dispositions come from real workflows. The dashboard records accept/revise/reject actions on each deliverable. For code review runs we also ingest GitHub signals: auto-fix PRs merged versus closed, inline comments resolved versus dismissed. Where the two disagree, the human's explicit disposition wins.
Escalated runs are excluded from the denominator and reported separately. An escalation is the agent declining to deliver; counting it as either accepted or rejected would corrupt the metric in opposite directions. Customers see both numbers.

One more honesty requirement: acceptance alone is gameable. An agent optimized purely for acceptance learns to make fewer, safer claims — short reviews full of unfalsifiable observations get accepted at wonderful rates while catching nothing. So acceptance is always read against the coverage dimensions of the rubric. Acceptance up while coverage drifts down is not improvement; it is the agent learning to flatter its graders, and we treat it as a regression.

Why "passes tests" is not enough

For coding agents the industry default is to lean on CI: if the tests pass, the work was good. For a code review agent this is a category error — green CI tells you about the code under review, nothing about the review itself. The failure modes that actually destroy a review agent's value are invisible to tests:

False positives. Every wrong finding spends reviewer trust. A reviewer who dismisses three bogus comments stops reading the fourth — which is the one that mattered. Comment dismissal rate is our canary metric, watched per repository, because trust erodes locally before it shows up globally.
Missed criticals. No test fails when a review overlooks an authorization bypass. We measure recall the only way you can: seeded-defect benchmarks, replaying changesets with known vulnerabilities and regressions through the agent and counting what it catches. The seeded set grows every time production surfaces a miss — each one becomes a permanent regression test for the reviewer, exactly as a bug becomes a test for the code.
Severity inflation. An agent that calls everything critical is as useless as one that misses everything, just more annoying. Severity calibration is scored in the rubric and audited against the human grading sample.

The same logic generalizes. "The memo compiles" was never the bar for human analysts; "the code runs" should not be the bar for agents whose deliverable is judgment.

What we still get wrong

Three open problems, so this does not read like a victory lap. Our graders favored longer reviews for months before we noticed and added length-controlled scoring — verbosity bias is the most persistent grader pathology we have found. Rubric versioning makes longitudinal comparison messy: when the rubric improves, history scored under the old one is not strictly comparable, and we annotate rather than rescore. And the human-agreement ceiling means small quarter-over-quarter movements in any of these numbers are noise; we try hard not to celebrate them.

Every run in the dashboard shows its rubric scores and confidence next to the deliverable — the same data described here, per run, including in the demo. We think any vendor publishing an acceptance number owes you this level of detail about what it means. Ask for it.