Appendix: Experiment Tracking and Statistical Significance#
When you improve your agent, is the improvement real?#
You change a prompt. You run the eval. The score went from 82% to 85%. Ship it?
Not yet. Three points on a 50-case eval is well inside the run-to-run noise floor of an LLM judge. The score could be up because the change helped, or down to judge drift, sampling randomness, or a handful of cases landing on the boundary of the rubric. Shipping "a 3-point improvement" when the real signal is "maybe +1, maybe -2" is the most common way experiment discipline goes wrong.
This appendix covers the minimum tooling: EvalComparison in swarm/eval/significance.py, the integration pattern for W&B / Comet / MLflow, and the per-experiment checklist.
EvalComparison in practice#
EvalComparison wraps two score lists and computes a Welch's t-test plus a 95% confidence interval. It is pure stdlib, no scipy, no new deps.
from swarm.eval.significance import EvalComparison, summarize
baseline = [r.score for r in baseline_run.results] # 50 cases, old prompt
new = [r.score for r in new_run.results] # 50 cases, new prompt
cmp = EvalComparison(baseline, new, metric_name="swe_bench_lite")
print(summarize(baseline, new, "swe_bench_lite"))
# swe_bench_lite: 82.3% -> 85.1% (delta=+2.8%, 95% CI [-0.1, 5.7], p=0.061, not significant)
if cmp.is_significant():
ship()
else:
n = cmp.required_sample_size(effect=0.03, power=0.8)
print(f"Need {n} cases per arm to detect a 3-point effect with 80% power.")
If the 95% CI brackets zero, you do not have a result, you have a hypothesis. Run more cases, or accept that the change is below your eval's resolution and stop iterating.
W&B / Comet / MLflow integration#
Wrap tracking in a hook handler so the harness stays framework-neutral. Gated import means the swarm runs fine when the library is absent.
async def make_tracking_hook(project: str, run_name: str):
try:
import wandb
wandb.init(project=project, name=run_name)
except Exception:
wandb = None
async def hook(payload: dict) -> None:
if wandb is None:
return
wandb.log({
"score": payload["avg_score"],
"pass_rate": payload["passed"] / payload["cases"],
"cost_usd": payload["total_cost_usd"],
"p99_latency_ms": payload.get("p99_latency_ms"),
})
return hook
bus.on("eval_run_complete", await make_tracking_hook("agents", "prompt_v7"))
Same shape works for Comet (comet_ml.Experiment) and MLflow (mlflow.log_metrics). The pattern is: import gated behind try, call init once per run, log structured floats on eval_run_complete. Your CI can then flip W&B off with one env var when you do not want the network dependency.
What to log per experiment#
The minimum record that lets you reproduce a result six months later:
- git commit of the swarm source at run time (
subprocess.check_output(["git", "rev-parse", "HEAD"])) - model id including vendor prefix (
claude-sonnet-4-6, notsonnet) - prompt version: hash the system prompt string; log the hash and the full text as an artifact
- tier router config: the full routing table, not just the label
- total cost in USD for the run, plus per-case cost for the p99 analysis
- p99 latency and p50 latency per case
- avg score plus the 95% CI from
EvalComparison. The scalar alone is misleading - eval case set version: if you added or retired cases, the score is not directly comparable to last week's
- environment flags:
SWARM_MOCK,SWARM_CACHE_ENABLED, anything that changes behavior
Pin these in one JSON blob per run, stored next to the eval JSONL. When a regression lands, git blame on the prompt hash finds the responsible change in under a minute.