Skip to content

Appendix: Experiment Tracking and Statistical Significance#

When you improve your agent, is the improvement real?#

You change a prompt. You run the eval. The score went from 82% to 85%. Ship it?

Not yet. Three points on a 50-case eval is well inside the run-to-run noise floor of an LLM judge. The score could be up because the change helped, or down to judge drift, sampling randomness, or a handful of cases landing on the boundary of the rubric. Shipping "a 3-point improvement" when the real signal is "maybe +1, maybe -2" is the most common way experiment discipline goes wrong.

This appendix covers the minimum tooling: EvalComparison in swarm/eval/significance.py, the integration pattern for W&B / Comet / MLflow, and the per-experiment checklist.

EvalComparison in practice#

EvalComparison wraps two score lists and computes a Welch's t-test plus a 95% confidence interval. It is pure stdlib, no scipy, no new deps.

from swarm.eval.significance import EvalComparison, summarize

baseline = [r.score for r in baseline_run.results]   # 50 cases, old prompt
new = [r.score for r in new_run.results]             # 50 cases, new prompt

cmp = EvalComparison(baseline, new, metric_name="swe_bench_lite")
print(summarize(baseline, new, "swe_bench_lite"))
# swe_bench_lite: 82.3% -> 85.1% (delta=+2.8%, 95% CI [-0.1, 5.7], p=0.061, not significant)

if cmp.is_significant():
    ship()
else:
    n = cmp.required_sample_size(effect=0.03, power=0.8)
    print(f"Need {n} cases per arm to detect a 3-point effect with 80% power.")

If the 95% CI brackets zero, you do not have a result, you have a hypothesis. Run more cases, or accept that the change is below your eval's resolution and stop iterating.

W&B / Comet / MLflow integration#

Wrap tracking in a hook handler so the harness stays framework-neutral. Gated import means the swarm runs fine when the library is absent.

async def make_tracking_hook(project: str, run_name: str):
    try:
        import wandb
        wandb.init(project=project, name=run_name)
    except Exception:
        wandb = None

    async def hook(payload: dict) -> None:
        if wandb is None:
            return
        wandb.log({
            "score": payload["avg_score"],
            "pass_rate": payload["passed"] / payload["cases"],
            "cost_usd": payload["total_cost_usd"],
            "p99_latency_ms": payload.get("p99_latency_ms"),
        })
    return hook

bus.on("eval_run_complete", await make_tracking_hook("agents", "prompt_v7"))

Same shape works for Comet (comet_ml.Experiment) and MLflow (mlflow.log_metrics). The pattern is: import gated behind try, call init once per run, log structured floats on eval_run_complete. Your CI can then flip W&B off with one env var when you do not want the network dependency.

What to log per experiment#

The minimum record that lets you reproduce a result six months later:

  • git commit of the swarm source at run time (subprocess.check_output(["git", "rev-parse", "HEAD"]))
  • model id including vendor prefix (claude-sonnet-4-6, not sonnet)
  • prompt version: hash the system prompt string; log the hash and the full text as an artifact
  • tier router config: the full routing table, not just the label
  • total cost in USD for the run, plus per-case cost for the p99 analysis
  • p99 latency and p50 latency per case
  • avg score plus the 95% CI from EvalComparison. The scalar alone is misleading
  • eval case set version: if you added or retired cases, the score is not directly comparable to last week's
  • environment flags: SWARM_MOCK, SWARM_CACHE_ENABLED, anything that changes behavior

Pin these in one JSON blob per run, stored next to the eval JSONL. When a regression lands, git blame on the prompt hash finds the responsible change in under a minute.