Appendix: CI/CD for Agent Systems#

A test suite that takes five seconds proves your code imports. A CI pipeline that runs your eval harness on every PR proves your system still works. The gap between those two is where agent regressions hide. This appendix walks through .github/workflows/agent-eval.yml and the choices behind it.

Why eval on every PR#

Agent behavior changes for reasons diff tools do not catch. A one-line prompt tweak can drop pass rate by 15 points. A model-tier swap shifts cost by an order of magnitude. A new tool in the registry can silently break a worker's ReAct loop because the description is ambiguous.

None of these regressions fail a unit test. They fail the eval harness, and only the eval harness. Running EvalHarness on every PR is the earliest feedback loop you have before a bad change reaches staging.

The workflow#

on:
  pull_request:
    paths:
      - 'swarm/**'
      - 'modules/*/code/**'
      - 'modules/*/solutions/**'

Scoping to those paths means documentation-only PRs do not burn CI minutes running tests that could not have changed. The path filter is the most impactful knob for keeping CI costs reasonable.

The pipeline has four stages.

Install. pip install -e ".[dev]" pulls in sklearn, pytest, and everything else. Pin the Python version so CI matches local.

Test suite. SWARM_MOCK=true pytest swarm/tests/ modules/ -q runs every unit test in mock mode. Zero API cost, deterministic, catches the cheap bugs.

Eval harness. python -m swarm.eval.harness --baseline origin/main --head HEAD --threshold 0.1 scores the PR's agents against the baseline on the committed eval set. A 0.1 score drop fails the build. Tighten for critical paths, loosen for exploratory modules.

Cost estimate. python scripts/estimate_cost_delta.py prints cost_delta_pct: +N.N%. A heuristic that catches the "someone swapped Haiku for Opus" class of mistake. Pair with --fail-on 20% once the team agrees on a ceiling.

When to add cost and latency gates#

Do not start with them. A gate that blocks merges on a small cost increase turns into a gate everyone disables within a week. Earn the gate by running the eval in advisory mode for two to four weeks, watching the numbers in real PRs, then setting a threshold that separates signal from noise.

Rule of thumb: a threshold should block fewer than 5 percent of merged PRs. If it blocks more, either your code is more sensitive than assumed or the threshold is too tight.

For latency, measure median and p95 separately. A 50 percent p95 jump might be noise (one slow API call) or a real regression (a new synchronous hot-path call). Emit p95 before gating.

Integration with EvalHarness#

swarm.eval.EvalHarness exposes run(model, system), compare(run_a, run_b), and pareto_point(run). The CI uses them directly:

baseline = await harness.run(model, baseline_system)
candidate = await harness.run(model, candidate_system)
delta = harness.compare(baseline, candidate)
if delta["score_delta"] < -threshold:
    sys.exit(1)

The harness does the hard part: loading cases, running them against both systems, reporting deltas. The workflow wraps this in a shell command and handles the exit code.

What this does not catch#

It will not catch prompt-injection regressions: add adversarial cases in the eval set. It will not catch cost regressions from tool-use loops that burn tokens on retries: add a max_total_cost gate. It will not catch drift between mock fixtures and real model output: schedule a weekly real-API run.

Ship the template, fix what it catches, add gates when you have evidence they help. Start simpler than you think you need to.