Chapter 09 — Capstone Benchmarks¶
Companion to book/ch09_*.md. Runs top-to-bottom in Google Colab in mock mode with no API key required.
import os
if not os.path.exists("crafting-agentic-swarms"):
!git clone https://github.com/TheAiSingularity/crafting-agentic-swarms.git
%cd crafting-agentic-swarms
!pip install -e ".[dev]" --quiet
!pip install matplotlib plotly ipywidgets --quiet
import os
try:
from google.colab import userdata
os.environ["ANTHROPIC_API_KEY"] = userdata.get("ANTHROPIC_API_KEY")
print("Using real API (key from Colab secrets).")
except (ImportError, Exception):
os.environ.setdefault("SWARM_MOCK", "true")
print("Running in mock mode (no API key needed).")
What you'll build here¶
- Run SWE-bench Lite and GAIA L1 subsets through the full swarm in mock mode.
- Tabulate per-case results with pandas.
- Plot a cost / accuracy Pareto scatter across benchmarks.
- Close the course with a 'what you built' review.
1. What this notebook does¶
You've shipped every module. This notebook takes the full production swarm and measures it against two classic benchmarks: SWE-bench Lite (code repair) and GAIA L1 (factual retrieval). Mock mode returns reproducible pass/fail mixes so the plots render meaningfully even without an API key.
2. Import the capstone entry points¶
modules/12_capstone/code/capstone.py wires the full swarm (M01-M11) against two benchmarks. We import the runner functions plus the published case lists.
import sys
from pathlib import Path
sys.path.insert(0, str(Path("modules/12_capstone/code")))
from capstone import (
run_swe_bench, run_gaia_l1,
SWE_BENCH_CASES, GAIA_L1_CASES,
pareto_analysis,
)
print(f"SWE-bench cases: {len(SWE_BENCH_CASES)}")
print(f"GAIA L1 cases: {len(GAIA_L1_CASES)}")
3. Inspect the SWE-bench cases¶
Each case has an id, an input prompt, an expected output, and tags. The tags drive later slicing (easy/hard, bugfix/feature, etc.).
for c in SWE_BENCH_CASES:
print(f"[{c.id}] tags={c.tags}")
print(f" {c.input[:90]}")
4. Run SWE-bench Lite¶
Five representative tasks. Mock returns 4/5 pass, which is realistic for a production swarm on SWE-bench Lite (published SOTA with specialised scaffolds is ~74 percent).
swe = await run_swe_bench(model="claude-haiku-4-5-20251001", max_cases=5)
print(f"Passed: {swe['passed']}/{swe['total']} rate={swe['pass_rate']:.1%}")
for r in swe["results"]:
status = 'PASS' if r['passed'] else 'FAIL'
print(f" [{status}] {r['case_id']} ${r['cost_usd']:.4f}")
5. Run GAIA L1¶
Three factual retrieval questions. Mock returns 3/3 — in live mode with tool use enabled the swarm still hits 3/3 on these easy cases. GAIA L2 and L3 are where things get hard.
gaia = await run_gaia_l1(model="claude-haiku-4-5-20251001")
print(f"Passed: {gaia['passed']}/{gaia['total']} rate={gaia['pass_rate']:.1%}")
for r in gaia["results"]:
status = 'PASS' if r['passed'] else 'FAIL'
print(f" [{status}] {r['case_id']} ${r['cost_usd']:.4f}")
6. Results table¶
Tidy both result sets into a pandas DataFrame. Keep case_id as the index so we can diff against a future run.
import pandas as pd
rows = []
for r in swe["results"]:
rows.append({"benchmark": "SWE-bench", **r})
for r in gaia["results"]:
rows.append({"benchmark": "GAIA-L1", **r})
df = pd.DataFrame(rows).set_index("case_id")
print(df)
7. Pass/fail chart by case¶
Visual view of the same table. Green is pass, red is fail. In production this is the view your tech lead screenshots into Slack on benchmark day.
import matplotlib.pyplot as plt
ids = df.index.tolist()
colors = ["#2ca02c" if p else "#d62728" for p in df["passed"]]
fig, ax = plt.subplots(figsize=(9, 3.5))
ax.bar(ids, [1] * len(ids), color=colors)
ax.set_ylim(0, 1.2)
ax.tick_params(axis="x", rotation=30)
ax.yaxis.set_visible(False)
ax.set_title("Per-case pass/fail (green = pass)")
plt.tight_layout()
plt.show()
8. Pareto analysis: simulated multi-model¶
Mock results come from a single model. For the Pareto plot we inject plausible numbers for Sonnet and Opus so the frontier has something to draw.
model_rows = [
{"model": "claude-haiku-4-5", "pass_rate": swe["pass_rate"], "cost_per_run": 0.08},
{"model": "claude-sonnet-4-6", "pass_rate": 0.62, "cost_per_run": 0.31},
{"model": "claude-opus-4-6", "pass_rate": 0.74, "cost_per_run": 1.55},
{"model": "gpt-4o", "pass_rate": 0.49, "cost_per_run": 0.42},
]
print(pareto_analysis(model_rows))
9. Pareto scatter plot¶
Same logic as Chapter 05 applied to benchmark results. Models below the frontier are dominated — another choice beats them in both cost and accuracy.
import plotly.express as px
pdf = pd.DataFrame(model_rows)
sorted_rows = pdf.sort_values("cost_per_run").to_dict(orient="records")
frontier = []
best = -1.0
for r in sorted_rows:
if r["pass_rate"] > best:
frontier.append(r["model"])
best = r["pass_rate"]
pdf["on_frontier"] = pdf["model"].isin(frontier)
fig = px.scatter(pdf, x="cost_per_run", y="pass_rate", color="on_frontier",
hover_data=["model"], size=[18] * len(pdf),
title="SWE-bench Lite Pareto frontier (simulated)")
fig.show()
10. Summary statistics¶
A stakeholder cares about three numbers: pass rate, cost per case, and total cost. Surface all three in a single dict.
total_cases = swe["total"] + gaia["total"]
total_passed = swe["passed"] + gaia["passed"]
total_cost = swe["total_cost_usd"] + gaia["total_cost_usd"]
summary = {
"pass_rate": f"{total_passed / total_cases:.1%}",
"cost_per_case": f"${total_cost / total_cases:.4f}",
"total_cost_usd": f"${total_cost:.4f}",
"cases": total_cases,
}
print(summary)
11. Bar chart by benchmark¶
Split pass rate by benchmark so you can see which one your swarm struggles on. This view goes straight on the README.
bench_rows = [
{"benchmark": "SWE-bench", "pass_rate": swe["pass_rate"], "cases": swe["total"]},
{"benchmark": "GAIA-L1", "pass_rate": gaia["pass_rate"], "cases": gaia["total"]},
]
bdf = pd.DataFrame(bench_rows)
fig = px.bar(bdf, x="benchmark", y="pass_rate", color="benchmark",
title="Pass rate by benchmark", text="pass_rate")
fig.update_traces(texttemplate="%{text:.0%}", textposition="outside")
fig.update_yaxes(range=[0, 1.05])
fig.show()
12. Cost breakdown¶
How much would the full benchmark suite cost on each model tier? Multiply cases by the per-case simulated cost to get a full-suite estimate.
cost_by_model = pdf.copy()
cost_by_model["total_suite_cost"] = cost_by_model["cost_per_run"] * total_cases
print(cost_by_model[["model", "cost_per_run", "total_suite_cost"]])
13. What you built (M01-M12)¶
Across the course you assembled, from first principles:
- M01: raw HTTP call with token counting and cost math.
- M02: multi-provider client with retry and exponential backoff.
- M03: ReAct loop driving tools and tool results.
- M04: tool sandbox with MCP integration.
- M05: memory with episodic, semantic, procedural tiers and autoDream consolidation.
- M06: generator-critic refinement loop.
- M07:
EvalHarness,LLMJudge, position-bias correction. - M08: orchestrator-workers fork-join with git worktrees.
- M09: triage routing plus 5 compaction strategies.
- M10: 29-event HookBus, HITL, prompt-injection defence.
- M11: KAIROS daemon, crash recovery, skill library.
- M12: SWE-bench and GAIA benchmarks, Pareto analysis.
14. Frontiers¶
The course ended on M12 but the field keeps moving. Directions worth exploring:
- DSPy: compile declarative LLM pipelines with few-shot examples as training data.
- A2A protocol: agent-to-agent JSON-RPC. An npm for agents.
- Agentic RL (RLAIF): fine-tune the agent policy against a critic LLM.
- Emergent swarm behaviour: 10+ agents on a blackboard, no orchestrator.
- Multi-modal agents: screenshots in, code out.
- Agent marketplaces: compose rented agents like packages.
Pick one. Build. Measure. Repeat.
15. Your next benchmark¶
The best way to keep learning is to publish your own numbers. Pick a task you care about (your own bug tracker, a GitHub repo, a support queue), wire the swarm against it, and post the Pareto plot. That is the feedback loop the whole field runs on.
Course complete¶
You have shipped every module. The next step is your own benchmark: pick a task you care about, run the swarm against it, publish the Pareto numbers. That is the feedback loop the field runs on.