Skip to content

Appendix: Designing a Custom Benchmark#

Why SWE-bench and GAIA are not enough#

SWE-bench Verified and GAIA (Appendix B) tell you roughly where you sit relative to every other team publishing scores. They do not tell you whether your agent is good at your job.

SWE-bench is scraped from public Python repos. If your agent works on TypeScript services, internal monorepos with 40 custom lint rules, or a data platform where the fix is a correct SQL rewrite, SWE-bench measures something adjacent to your work but not your work. GAIA is further removed; multi-hop web research is a skill no production agent gets paid for in most shipping apps. Using these as your primary signal lets you optimize for the wrong distribution: your public score rises while your users keep reporting the same bugs.

The fix is a benchmark of your own. Small, specific to your domain, versioned, owned by the team shipping the agent.

Methodology#

Problem-set creation#

Seed from reality. In rough order of signal strength:

  • Real support tickets or customer requests: the distribution of things users actually ask for.
  • Real PRs, issues, or queries from your repos and analytics tools, especially the ones your team closed as fixed with a clear ground-truth answer.
  • Historical failure cases: the last 20 incidents or regressions. Each is a benchmark case with documented correct behavior.

Avoid synthesizing cases from an LLM. Synthesized cases measure how well your agent handles the distribution an LLM imagines, which rarely matches production.

Bucket by difficulty: roughly 40% easy, 40% medium, 20% hard. Proportions matter; a benchmark with 95% easy cases lets noise dominate the headline metric. Target 50-200 cases. Below 50, confidence intervals are too wide to detect real changes. Above 200, runs cost enough that you stop doing them.

Oracle / ground truth#

Three options, in order of cost and reliability:

  • Programmatic check: the expected output is data you can compare bit-for-bit (a cleaned CSV, a passing test suite, a SQL query returning a known row set). Best when available. Zero per-run cost, no judge bias.
  • LLM-as-judge: a second model scores the agent's output against a rubric. Cheaper than human annotation but introduces judge drift; pin the judge model and rubric version (Chapter 05 covers this).
  • Human annotation: the gold standard, priced accordingly. Usually reserved for the final 10-20 hardest cases where neither of the above works.

Most real benchmarks mix all three. A good heuristic: start programmatic wherever possible, fall back to LLM-as-judge for qualitative criteria, and reserve human review for ambiguous tail cases.

Metric choice#

Accuracy alone is not enough. Pair the pass rate with:

  • Cost per case: the USD figure from your Chapter 05 harness. Quality numbers detached from cost are advertising.
  • p50 and p99 latency: averages hide the failure modes that matter. A 90% pass rate with a 30-second p99 is a user-hostile agent.
  • Failure-mode breakdown: categorize the failed cases. "Pass rate 85%" without knowing whether the 15% are timeouts, hallucinations, or correct answers in the wrong format is not actionable.

Report all four together. A single scalar can be gamed; a four-tuple is much harder to move in the wrong direction without the team noticing.

Worked example: data quality fixer#

A concrete custom benchmark for an agent that fixes quality issues in CSV files.

The problem set#

Ten cases covering the issues that appear in real intake pipelines:

  1. Null values in required columns (replace or drop)
  2. Duplicate rows on a natural key
  3. Numeric columns stored as strings ("1,234" and "$45.67")
  4. Mixed encodings (UTF-8 with stray latin-1 bytes)
  5. Schema drift: a new optional column that the downstream system cannot handle
  6. Inconsistent date formats ("2026-04-22", "04/22/2026", "Apr 22 2026")
  7. Trailing whitespace in string columns
  8. Outliers that are obvious typos (age = 201, price = -50)
  9. Column name drift (snake_case vs camelCase vs Sentence Case)
  10. Header row missing entirely

Each case ships with an input CSV and an expected-clean CSV. That pairing is the whole oracle.

Oracle: programmatic#

The check is a pandas DataFrame.equals() between the agent's output and the expected file. Zero ambiguity. Zero judge cost. Fast enough to run 10 cases in under a second of oracle time.

import pandas as pd

def oracle(agent_output_path: str, expected_path: str) -> bool:
    actual = pd.read_csv(agent_output_path)
    expected = pd.read_csv(expected_path)
    return actual.equals(expected)

Integration with EvalHarness#

The harness from Chapter 05 is small enough to plug a custom oracle into directly.

from pathlib import Path
from swarm.eval.harness import EvalCase, EvalHarness

def load_cases(base: Path) -> list[EvalCase]:
    return [
        EvalCase(
            id=d.name,
            input=(d / "input.csv").read_text(),
            expected_output=(d / "expected.csv").read_text(),
            tags=["data_quality", d.name],
        )
        for d in sorted(base.iterdir()) if d.is_dir()
    ]

cases = load_cases(Path("benchmarks/dq/v1"))
harness = EvalHarness(cases)  # no LLM judge; string-match oracle suffices

run = await harness.run(
    model="claude-sonnet-4-6",
    system="You are a data quality agent. Clean the CSV. Return only the cleaned CSV.",
    bus=bus,
)

print(f"pass_rate: {run.passed / run.cases:.1%}")
print(f"avg_cost:  ${run.total_cost_usd / run.cases:.3f}")
print(f"p99_latency_ms: {sorted(r.latency_ms for r in run.results)[-1]}")

Interpreting the first run#

A plausible first result:

pass_rate: 85% (8.5 / 10)
avg_cost:  $0.031 / case
p99_latency_ms: 6200

Read as a triple, not a single number. 85% is the headline. $0.03 per case means running this on every PR is cheap. 6s p99 is tolerable in batch, uncomfortable user-facing. The two failures are the work: look at the case IDs, read the outputs, iterate.

Versioning your benchmark#

Benchmarks drift. Add cases when new failure modes appear in production. Retire cases when every variant passes them; they no longer discriminate.

Pin a version: benchmarks/dq/v1/, benchmarks/dq/v2/. When the case set changes, the version bumps. Do not directly compare v1 to v2 scores. Keep the old version runnable for a few releases so you can do an apples-to-apples comparison before switching.

The habit that compounds: every time a production bug slips past the agent, add a case for it in the next version. Over a year the benchmark becomes a living record of the failure modes your agent has learned to prevent.