Appendix: Debugging and Instrumentation Playbook#
Agents fail in ways that no stack trace captures. A rm -rf fails loudly, with a line number. A worker that cheerfully loops for forty-five minutes burning twelve dollars on the same malformed tool call fails quietly, and the logs show ten thousand lines that all look fine individually. You need a different kind of runbook for these.
This appendix is organized around symptoms, not causes. You saw something in production — a billing alert, a regression on the eval harness, an output that violates a rule you thought you had locked down. You open the matching section. It walks you to the root cause without asking you to first guess which subsystem is broken.
Every scenario references tools you already have from the course: the hook bus (swarm/hooks/bus.py), the cost tracker (swarm/hooks/cost_hook.py), the eval harness (swarm/eval/harness.py), the memory store (swarm/memory/store.py), the audit log written by swarm/hooks/audit_hook.py, and the OpenTelemetry wiring in swarm/observability/telemetry.py. You do not need to install anything new. If any of these feels unfamiliar, re-read the chapter it came from before you run the steps.
A triage checklist is at the end. If you only have two minutes, start there.
Scenario 1: Agent cost suddenly jumped 10x#
Symptom. The billing alert fired. Yesterday's run cost $12. Today's run cost $120. No deploy went out. Nothing obvious changed.
Triage.
- Open the cost breakdown. Two tools will tell you where the money went. The
CostTrackerkept byswarm.hooks.cost_hook.get_cost_summary()has per-model and per-role totals. The rawCallRecordentries inswarm.core.records.COST_LOGhave per-agent, per-task detail. Start with the summary, then drill down.
from swarm.hooks.cost_hook import get_cost_summary
summary = get_cost_summary()
for model, stats in sorted(
summary["by_model"].items(),
key=lambda kv: kv[1]["cost_usd"],
reverse=True,
):
print(f"{model:40s} ${stats['cost_usd']:>10.4f} ({stats['calls']} calls)")
-
If one model is spiking, look at routing.
swarm/routing/router.pyuses the triage agent to classify tasks and return aTaskRoute. Check whether the triage agent started misclassifying low-complexity tasks as "high" and sending them to Opus. Grep the audit log for recentpost_agent_callevents withrole=triageand replay a sample throughroute_task()by hand. A drift in the triage system prompt, a provider-side format change, or a new task category the triage agent has never seen will all show up as a complexity-class shift. -
If one role is spiking, look at
COST_LOGper-agent. A misbehaving worker will show up as an outlier immediately — oneagent_idwith ten times the cost of its peers.
from collections import defaultdict
from swarm.core.records import COST_LOG
per_agent: dict[str, float] = defaultdict(float)
per_agent_calls: dict[str, int] = defaultdict(int)
for r in COST_LOG:
per_agent[r.agent_id] += r.cost_usd
per_agent_calls[r.agent_id] += 1
top10 = sorted(per_agent.items(), key=lambda kv: kv[1], reverse=True)[:10]
for agent_id, cost in top10:
calls = per_agent_calls[agent_id]
print(f"{agent_id:30s} ${cost:>8.4f} ({calls} calls, ${cost/calls:.4f}/call)")
- If one task is spiking, pull the transcript for that
task_id.TranscriptStore.grep()inswarm/memory/transcripts.pywill find every line that mentions it. Three patterns show up repeatedly: a loop that never terminates (see Scenario 5), a retry budget that never resets (seemodules/01_raw_call/exercises/02_retry_budget.py), or a user who pasted a novel into the prompt. The transcript is the fastest way to tell which.
Prevention. Budget caps belong on the critical path, not in a dashboard you check weekly. Register a post_agent_call handler that reads get_cost_summary()["total_usd"] on every call and raises HookAbortError at 100% of the daily budget, logging a warning at 80%. The hook bus in swarm/hooks/bus.py will propagate the abort up through the loop, and the orchestrator will see a clean failure instead of a silent overrun. For long-running swarms, reset the tracker nightly with reset_cost_state() so the budget is per-day rather than per-process.
Scenario 2: Agent accuracy dropped after a provider change#
Symptom. You switched the worker role from Sonnet to Haiku to save money. The eval harness pass rate dropped from 85% to 62%. The cost dropped too, but not enough to justify the regression.
Triage.
- Re-run the eval on both models explicitly.
EvalHarnessinswarm/eval/harness.pytakes a model argument on.run(), so run twice and compare.
from swarm.eval.harness import EvalHarness
harness = EvalHarness(cases=my_cases, judge=my_judge)
run_sonnet = await harness.run(model="claude-sonnet-4-6", system=SYSTEM)
run_haiku = await harness.run(model="claude-haiku-4-5-20251001", system=SYSTEM)
print(harness.compare(run_sonnet, run_haiku))
compare() returns score_delta, pass_rate_a and pass_rate_b, cost_delta_usd, and latency_delta_ms. Use those numbers, not vibes.
-
Look at per-case breakdown.
EvalRun.resultsis a list ofEvalResult; each hascase_id,passed,score, andoutput. Filter to cases that passed on Sonnet and failed on Haiku. Do not look at the aggregate pass rate until you have read ten of the regressions end to end. -
Classify the regressions. In practice they fall into four buckets: tool-call failures (the model emitted malformed JSON for
tool_uses), wrong format (correct content, wrong structure), hallucination (plausible but wrong), and truncation (ran out ofmax_tokensbefore finishing). Each bucket has a different fix. Tag the failures manually on your first pass; the pattern will be obvious after ten cases. -
Check the system prompt. Prompts written against a stronger model often carry implicit assumptions — "think step by step before answering" assumes the model can hold the chain. Haiku's context is shorter and its reasoning depth is lower. A prompt that says "consider all relevant factors" will get a deeper answer from Sonnet than from Haiku, and that gap shows up as regressions on open-ended cases.
-
Do a Pareto analysis. Compute cost-per-passed-case, not cost-per-call.
total_cost_usd / passedis what matters. If Haiku costs one-fifth as much but gets two-thirds the pass rate, the cost per correct answer is still lower — but only if your workload can tolerate the wrong answers. For most production workloads it cannot.
Prevention. Pin the provider and model explicitly in your SwarmConfig.model_overrides. When you want to try a new model, use a canary: route 1% of traffic to the new model, run the eval harness on the canary output nightly, and only promote when the pass rate and cost are both acceptable. Never swap a role's model based on price alone.
Scenario 3: Memory is corrupting#
Symptom. The agent is confidently returning outdated facts. A user reports "you said my plan was Pro but I upgraded to Enterprise last month." You check the transcripts and see the upgrade was logged correctly yesterday. Today's answer references a stale topic file.
Triage.
- Check the index.
MemoryIndexinswarm/memory/index.pyis the top-level pointer list atmemory/MEMORY.md. If the index is stale — it still points at a topic that was superseded, or it omits a topic that exists on disk — the agent will read the stale pointer and miss the fresh topic. Compareindex.all_topics()against the actual.mdfiles in the memory dir.
from pathlib import Path
from swarm.memory.store import MemoryStore
store = MemoryStore("./memory")
indexed = set(store.index.all_topics())
on_disk = {p.name for p in Path("./memory").glob("*.md") if p.name != "MEMORY.md"}
print("orphan topic files:", on_disk - indexed)
print("dangling index pointers:", indexed - on_disk)
for t in indexed & on_disk:
topic = store.read_topic(t)
print(f"{t}: {len(topic.body)} chars")
-
Check for concurrent writes. Two daemons sharing one memory dir will both try to run
maybe_dream(), both try to take the advisory lock atmemory/dream.lock, and race on the topic files. The lock file contains a JSON blob withpidandts; if you see a PID that is not yours and the timestamp is fresh, another process is writing. If the PID is gone but the file is there,acquire_lock()inswarm/memory/dream.pywill clean it up after five minutes — but a corrupted write from a crash mid-consolidation can leave the topic file half-written in the interim. -
Check when autoDream last ran.
memory/dream_state.jsonhas alast_dreamISO timestamp. If it is more than 24 hours old, consolidation has not fired and your transcripts are full of facts that never got promoted to the topic layer. The triple-gate inshould_dream()wants 5 sessions and 24 hours elapsed and no lock held; if any gate fails, nothing happens. Print the gate state to find out which. -
Walk the transcripts. Iterate the JSONL files under
memory/transcripts/. A crashed write will leave a truncated final line that failsjson.loads()._entry_from_line()inswarm/memory/transcripts.pysilently returnsNoneon parse failures, which means corrupted entries are skipped rather than visible — you have to walk the raw lines yourself.
import json
from pathlib import Path
bad = []
for p in Path("./memory/transcripts").glob("*.jsonl"):
for i, line in enumerate(p.read_text().splitlines()):
try:
json.loads(line)
except json.JSONDecodeError:
bad.append((p.name, i, line[:80]))
for name, i, preview in bad:
print(f"{name}:{i} {preview}")
- Check for size leak.
du -sh memory/should be roughly stable day over day once the swarm has warmed up. If it is growing unbounded, consolidation is firing buttopics.delete()is not. Look at the Dream LLM'sdeletelist in recent runs (the summary is indream_state.json["last_summary"]) — if it is always empty, the prompt is not surfacing stale entries.
Prevention. One daemon per memory dir, always. Use a supervisor (systemd, launchd, Docker) that refuses to start a second instance. The advisory lock exists to cover rare races, not to enable parallel writers. Schedule consolidation on a cron or a bus event rather than relying on the organic session counter — for low-traffic deployments the session counter will never cross the gate.
Scenario 4: Safety rules are being bypassed#
Symptom. Your constitution has a rule that says "never mention competitor X". The agent just mentioned competitor X in a user-facing response. Support forwards you the transcript.
Triage.
- Check the audit log.
make_audit_hook_for_eventinswarm/hooks/audit_hook.pywrites one line per hook event to./logs/audit.jsonl. Each entry has the event name, timestamp, and agent_id. For the agent_id in question, pull every event around the timestamp of the violation and look forsecurity_blockandsecurity_allowevents — those are the signals that the constitution check fired.
import json
from pathlib import Path
target_agent = "worker_7"
for line in Path("./logs/audit.jsonl").read_text().splitlines():
entry = json.loads(line)
if entry.get("agent_id") == target_agent and entry["event"] in {
"security_block", "security_allow", "post_agent_call",
}:
print(entry["ts"], entry["event"], entry.get("payload_keys"))
The audit hook records only keys, never values, by design — so you will see that a security_block fired but not what triggered it. To get the content, you need the transcript. TranscriptStore.grep() with the agent_id as the pattern is the fastest way.
-
Check hook ordering. Handlers in
HookBusrun in registration order (sync first, then async, each group sequential). If a later hook mutates the message after the constitution check, the check was useless. Printbus.handler_count("post_agent_call")and look at the order handlers were registered. A compaction hook or a formatting hook registered after the security hook can rewrite content the security hook already blessed. -
Check for injection. User prompts can contain phrases from the
DENYLISTinswarm/safety/injection.py— things like "ignore previous instructions", "new task:", "pretend you are".verify_tool_output()wraps untrusted text in an<tool_output_untrusted>block, but it only runs on tool output, not on the user's original request. If the injection rode in via user input and the model complied, the constitution rule had no chance. -
Check for model drift. The constitution rules in
swarm/safety/constitution.pyare regexes and they match the agent's action text. A model update can change how the agent phrases things — for example, if the model starts callingsubprocess.run(["rm", "-rf", ...])instead ofrm -rf, rule C01's regex\brm\b.*-[^\s]*r|shutil\.rmtreewill miss it. Check the last deploy log for a provider SDK bump or a model ID change. -
Review the rule's regex. "Never mention competitor X" is often implemented as
re.search(r"competitor X", text, re.IGNORECASE). That matches "Competitor X" but not "X, our competitor" or "the other vendor". Walk through the actual offending output, paste it into a quick regex tester, and see which wordings slip through.check_constitution()returns the list of rules that fired — zero rules fired means your regex did not cover the surface area you thought it did.
Prevention. For high-stakes rules, add a second layer: a dedicated screening model (Haiku is cheap and fast enough) that reads the final output with a prompt like "Does this response mention competitor X in any form? Answer YES or NO." Call it from a post_agent_call hook; abort the chain on YES. This is the dual-LLM defense pattern. Also: run check_denylist() on the user's input before dispatch, not only on tool output.
Scenario 5: Agent loops indefinitely#
Symptom. A worker task has been running for forty-five minutes. LoopState.iterations keeps climbing. The logs show the same tool being called with slight variations — different filenames, different query strings, but the same tool with the same error class coming back.
Triage.
-
Check the iteration ceiling.
run_loopinswarm/core/loop.pytakesmax_iterationswith a default of 10. If a custom loop was written for this worker, check whether the ceiling was bumped to a high number "just for testing" and never lowered. Ten is a reasonable production default; fifty is not. -
Check the tool-error ceiling.
LoopStateonly tracksiterationsandtool_calls_made— it does not track consecutive failures. A loop that keeps callingread_file("/path/that/does/not/exist")will happily iterate ten times without hitting any explicit failure. Wrap the loop state so you count consecutive errors and break early:
from dataclasses import dataclass, field
from swarm.core.loop import LoopState
@dataclass
class ErrorAwareLoopState(LoopState):
consecutive_errors: int = 0
max_consecutive_errors: int = 3
last_error_signature: str = ""
def record_tool_result(self, tool_name: str, result: str) -> None:
signature = f"{tool_name}:{result[:80]}"
if result.startswith("ERROR") or "error" in result.lower()[:20]:
if signature == self.last_error_signature:
self.consecutive_errors += 1
else:
self.consecutive_errors = 1
self.last_error_signature = signature
else:
self.consecutive_errors = 0
self.last_error_signature = ""
@property
def should_break(self) -> bool:
return self.consecutive_errors >= self.max_consecutive_errors
-
Check the system prompt. Agents loop because they do not know when to stop. A good system prompt has explicit termination rules: "If you cannot complete the task in three tool calls, return the partial result with an explanation." Without a stop rule the model will keep trying — and the loop's
max_iterationsis the only safety net. -
Check the tool registry. A tool that returns a slightly different error every time is a honeypot for infinite loops. If
read_filereturns "file not found at /a/b/c" on one call and "cannot locate /a/b/c" on the next, the agent treats them as different problems and keeps trying. Normalize error messages in the tool implementation (swarm/tools/registry.pyis where dispatch happens) so identical failures look identical. -
Check network timeouts. A tool that calls out over the network without a timeout will hang for whatever the OS socket default is. The loop does not see a hang as a failure — it sees a long, successful call. Every network tool needs an explicit timeout, and the timeout should be shorter than the outer loop's overall deadline.
Prevention. Circuit-breaker pattern: kill the loop after K consecutive errors and after T seconds of wall-clock, whichever comes first. Add explicit termination rules to the system prompt ("If the task cannot be completed in N steps, return a partial result and stop."). Emit a worker_failed event from the break branch so the orchestrator can route the task elsewhere or escalate to a human.
Closing: triage checklist#
Bookmark this. When something breaks in production, read it before you open any code.
- Did I check
./logs/audit.jsonlfor the agent_id and timeframe in question? - Did I rank-order the top-10 cost contributors via
COST_LOG? - Did I run
EvalHarness.compare()on the model before and after the change I suspect? - Did I verify memory integrity — index consistent, no orphans, no parse errors in JSONL transcripts?
- Did I check for
DENYLISTpatterns in the user's input, not only tool output? - Did I inspect the hook registration order to confirm the security hook runs last?
- Did I confirm
max_iterationsis set and sane for every custom loop? - Did I check for tools without network timeouts?
- Did I pull the raw transcript for the offending
task_idand read it end to end? - Did I print
get_cost_summary()during the problem run, not only after?
The checklist assumes you have observability wired up before the incident. If you do not, wire it up now. setup_telemetry() in swarm/observability/telemetry.py takes a service name and configures OpenTelemetry against either an OTLP endpoint (set OTEL_EXPORTER_OTLP_ENDPOINT) or a console exporter as a fallback. Every call_agent produces a gen_ai.* span via record_agent_span(), and every hook event produces a hook.{event} span. Trace IDs propagate through the hook bus via the _trace_id payload key, so cross-agent emissions share a trace.
Ship traces to a collector you can query. Grafana Tempo, Honeycomb, Datadog, Jaeger — any of them will let you filter by swarm.task_id and see the full fan-out of a single user request across orchestrator, workers, and hooks. Without that, you are reading audit lines with grep and hoping you picked the right timestamp window. With it, the five scenarios above become five saved queries.
The audit log, the cost log, and the trace exporter are cheap. Turn them on before you need them.