Chapter 05 — Evaluation and Pareto Frontiers¶
Companion to book/ch05_*.md. Runs top-to-bottom in Google Colab in mock mode with no API key required.
import os
if not os.path.exists("crafting-agentic-swarms"):
!git clone https://github.com/TheAiSingularity/crafting-agentic-swarms.git
%cd crafting-agentic-swarms
!pip install -e ".[dev]" --quiet
!pip install matplotlib plotly ipywidgets --quiet
import os
try:
from google.colab import userdata
os.environ["ANTHROPIC_API_KEY"] = userdata.get("ANTHROPIC_API_KEY")
print("Using real API (key from Colab secrets).")
except (ImportError, Exception):
os.environ.setdefault("SWARM_MOCK", "true")
print("Running in mock mode (no API key needed).")
What you'll build here¶
- Build a 5-case synthetic eval set and run it through
swarm.eval.harness.EvalHarness. - Score three simulated models with
swarm.eval.judge.LLMJudge. - Plot a 2D Pareto frontier of cost times accuracy.
- Measure judge position bias with pairwise comparisons run in both orders.
1. Why evaluation matters¶
Every time a model version changes, an agent's behaviour can shift in subtle ways. A small eval set pinned to a known-good revision is the cheapest way to catch regressions before they hit users. The EvalHarness class gives us a reproducible shell: load cases, run them against a model, save the results.
2. Build a synthetic eval set¶
Each EvalCase carries an id, an input prompt, an optional reference answer, and tags. We use tags to slice results later (easy vs hard, code vs prose).
from swarm.eval.harness import EvalCase
cases = [
EvalCase(id="c1", input="Summarise the Pareto frontier concept in one sentence.",
expected_output="dominant trade-off", tags=["prose", "easy"]),
EvalCase(id="c2", input="Write Python that reverses a list in place.",
expected_output="reverse", tags=["code", "easy"]),
EvalCase(id="c3", input="Explain the tradeoff between Haiku and Sonnet for routing.",
expected_output="cost", tags=["prose", "medium"]),
EvalCase(id="c4", input="Analyze why LLM-as-judge shows position bias.",
expected_output="order", tags=["prose", "hard"]),
EvalCase(id="c5", input="Implement fork-join with asyncio.gather.",
expected_output="gather", tags=["code", "medium"]),
]
print(f"{len(cases)} eval cases ready")
3. Run the harness against three simulated models¶
In mock mode, the harness does not hit any API. It replays fixture responses so the code path (scoring, cost accounting, latency) exercises end-to-end. Top-level await is native in Colab, so no event-loop boilerplate.
from swarm.eval.harness import EvalHarness
harness = EvalHarness(cases)
model_ids = [
"claude-haiku-4-5-20251001",
"claude-sonnet-4-6",
"claude-opus-4-6",
]
runs = []
for m in model_ids:
run = await harness.run(model=m, system="You are an evaluator.")
runs.append(run)
print(f"{m}: {run.passed}/{run.cases} passed, avg_score={run.avg_score}")
4. Inspect a single result¶
EvalResult carries enough detail to diagnose a failing case: the full output, the score, latency, and cost. In production you persist this as JSON-lines and diff across revisions.
first = runs[0].results[0]
print(f"case_id={first.case_id}")
print(f"passed={first.passed} score={first.score} cost=${first.cost_usd:.4f} latency={first.latency_ms}ms")
print(f"output: {first.output[:200]}")
5. Synthesise per-model quality differences¶
Mock mode returns identical outputs across models. To make the Pareto plot meaningful we simulate plausible quality deltas: bigger models correlate with higher scores and higher cost. These numbers come from published April-2026 rates per million input tokens.
import random
SIM_QUALITY = {
"claude-haiku-4-5-20251001": 0.72,
"claude-sonnet-4-6": 0.86,
"claude-opus-4-6": 0.93,
}
SIM_COST_PER_M_IN = {
"claude-haiku-4-5-20251001": 0.80,
"claude-sonnet-4-6": 3.00,
"claude-opus-4-6": 15.00,
}
random.seed(42)
summary = []
for run in runs:
base = SIM_QUALITY[run.model]
accuracy = sum(max(0.0, min(1.0, base + random.uniform(-0.1, 0.1))) for _ in run.results) / len(run.results)
cost = SIM_COST_PER_M_IN[run.model] * 0.001 * len(run.results)
summary.append({"model": run.model, "accuracy": round(accuracy, 3), "cost_usd": round(cost, 4)})
for row in summary:
print(row)
6. Compute the Pareto frontier¶
Sort by cost and walk the points keeping only ones that strictly improve accuracy. Those points form the efficient frontier. Anything below is dominated by a cheaper option.
sorted_rows = sorted(summary, key=lambda r: r["cost_usd"])
frontier = []
best_acc = -1.0
for row in sorted_rows:
if row["accuracy"] > best_acc:
frontier.append(row)
best_acc = row["accuracy"]
print("Frontier models (in cost order):")
for row in frontier:
print(f" {row['model']:32s} accuracy={row['accuracy']:.3f} cost=${row['cost_usd']:.4f}")
7. Plot the frontier with matplotlib¶
A 2D scatter puts each model at (cost, accuracy). The dashed line connects the efficient points; everything below is a dominated choice for your current task.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 5))
for row in summary:
ax.scatter(row["cost_usd"], row["accuracy"], s=120)
label = row["model"].split("-")[1]
ax.annotate(label, (row["cost_usd"], row["accuracy"]),
xytext=(8, 4), textcoords="offset points")
fx = [r["cost_usd"] for r in frontier]
fy = [r["accuracy"] for r in frontier]
ax.plot(fx, fy, "--", alpha=0.6, label="Pareto frontier")
ax.set_xlabel("Cost (USD) — lower is better")
ax.set_ylabel("Accuracy — higher is better")
ax.set_title("Cost / accuracy Pareto frontier (simulated)")
ax.legend()
ax.grid(alpha=0.3)
plt.show()
8. Judge position bias¶
LLM judges often prefer whichever candidate they see first — a well-documented artefact called position bias. We run each comparison twice, swap the order, and count how often the verdict flips. The pairwise method in LLMJudge does this for us automatically.
from swarm.eval.judge import LLMJudge
judge = LLMJudge(model="claude-sonnet-4-6")
pairs = [
("Response about Haiku cost", "Response about Sonnet cost"),
("Response explaining parallelism", "Response explaining serial execution"),
("Short answer", "Long answer with examples"),
("Code with no tests", "Code with tests"),
("Vague plan", "Structured plan with milestones"),
]
verdicts = []
for a, b in pairs:
verdict = await judge.pairwise(a, b, prompt="Which response is better?")
verdicts.append(verdict)
print(f"winner={verdict.winner} swapped={verdict.position_swapped}")
9. Visualise flip rate¶
In mock mode every fixture is identical, so position_swapped stays False. With a real judge you expect a non-trivial flip rate around 10-30 percent on ambiguous pairs. The fix is always to score in both orders and average the scores, which is what pairwise does.
flipped = sum(1 for v in verdicts if v.position_swapped)
total = len(verdicts)
fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(["stable", "flipped"], [total - flipped, flipped], color=["#3d7eff", "#ff6b6b"])
ax.set_title(f"Pairwise judge flip rate: {flipped}/{total}")
ax.set_ylabel("pairs")
plt.show()
10. Score deltas by tag¶
Aggregate scores across tags to see which kinds of cases the swarm handles well. In production this is how you spot category-specific regressions during model upgrades. If code drops 15 percent but prose holds, you know where to look.
import collections
tag_scores = collections.defaultdict(list)
for run in runs:
for case, result in zip(cases, run.results):
for tag in case.tags:
tag_scores[tag].append(result.score)
rows = [(tag, round(sum(v) / len(v), 3), len(v)) for tag, v in tag_scores.items()]
for tag, avg, n in sorted(rows, key=lambda r: -r[1]):
print(f"{tag:10s} avg={avg} n={n}")
11. Per-tag bar chart¶
Same data, visual: a stakeholder wants to see at a glance where the weak points are.
tag_names = [t for t, _, _ in sorted(rows, key=lambda r: -r[1])]
tag_avgs = [a for _, a, _ in sorted(rows, key=lambda r: -r[1])]
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(tag_names, tag_avgs, color="#3d7eff")
ax.set_ylim(0, 1.05)
ax.set_ylabel("Average score")
ax.set_title("Per-tag average score")
plt.show()
12. Export run log¶
Every EvalRun can be saved as JSON-lines for later diffing. Keep these in source control so you can bisect a quality regression to a specific commit. One file per run.
import tempfile
from pathlib import Path
tmp = Path(tempfile.mkdtemp())
for run in runs:
path = tmp / f"{run.model}.jsonl"
harness.save(run, str(path))
print(f"wrote {path.name} size={path.stat().st_size} bytes")
13. Reload and compare two runs¶
harness.compare diffs two runs side by side. It reports score delta, cost delta, and latency delta. This is the primitive for CI: new run should be no worse than the last green run.
reloaded_a = harness.load(str(tmp / f"{runs[0].model}.jsonl"))
reloaded_b = harness.load(str(tmp / f"{runs[1].model}.jsonl"))
diff = harness.compare(reloaded_a, reloaded_b)
for k, v in diff.items():
print(f" {k}: {v}")
14. Interactive Pareto with plotly¶
Plotly gives hover tooltips for free. Colour-code the frontier so the dominated points stand out. Drop this straight into an internal dashboard.
import plotly.express as px
import pandas as pd
df = pd.DataFrame(summary)
df["on_frontier"] = df["model"].isin([r["model"] for r in frontier])
fig = px.scatter(df, x="cost_usd", y="accuracy", color="on_frontier",
size=[20] * len(df), hover_data=["model"],
title="Pareto frontier (interactive)")
fig.show()
15. What to try next¶
- Add a real Anthropic key via Colab secrets and re-run Cell 3. Watch
cost_usdbecome non-zero. - Swap the judge model between Haiku and Sonnet and see whether the flip rate changes.
- Add a 6th case to your eval set and re-run. How much does that shift the Pareto frontier?
- Persist
runsas checked-in fixtures; wire CI to fail whenavg_scoredrops below a threshold.