Skip to content

Appendix: Debugging and Instrumentation Playbook#

Agents fail in ways that no stack trace captures. A rm -rf fails loudly, with a line number. A worker that cheerfully loops for forty-five minutes burning twelve dollars on the same malformed tool call fails quietly, and the logs show ten thousand lines that all look fine individually. You need a different kind of runbook for these.

This appendix is organized around symptoms, not causes. You saw something in production — a billing alert, a regression on the eval harness, an output that violates a rule you thought you had locked down. You open the matching section. It walks you to the root cause without asking you to first guess which subsystem is broken.

Every scenario references tools you already have from the course: the hook bus (swarm/hooks/bus.py), the cost tracker (swarm/hooks/cost_hook.py), the eval harness (swarm/eval/harness.py), the memory store (swarm/memory/store.py), the audit log written by swarm/hooks/audit_hook.py, and the OpenTelemetry wiring in swarm/observability/telemetry.py. You do not need to install anything new. If any of these feels unfamiliar, re-read the chapter it came from before you run the steps.

A triage checklist is at the end. If you only have two minutes, start there.


Scenario 1: Agent cost suddenly jumped 10x#

Symptom. The billing alert fired. Yesterday's run cost $12. Today's run cost $120. No deploy went out. Nothing obvious changed.

Triage.

  1. Open the cost breakdown. Two tools will tell you where the money went. The CostTracker kept by swarm.hooks.cost_hook.get_cost_summary() has per-model and per-role totals. The raw CallRecord entries in swarm.core.records.COST_LOG have per-agent, per-task detail. Start with the summary, then drill down.
from swarm.hooks.cost_hook import get_cost_summary
summary = get_cost_summary()
for model, stats in sorted(
    summary["by_model"].items(),
    key=lambda kv: kv[1]["cost_usd"],
    reverse=True,
):
    print(f"{model:40s} ${stats['cost_usd']:>10.4f}  ({stats['calls']} calls)")
  1. If one model is spiking, look at routing. swarm/routing/router.py uses the triage agent to classify tasks and return a TaskRoute. Check whether the triage agent started misclassifying low-complexity tasks as "high" and sending them to Opus. Grep the audit log for recent post_agent_call events with role=triage and replay a sample through route_task() by hand. A drift in the triage system prompt, a provider-side format change, or a new task category the triage agent has never seen will all show up as a complexity-class shift.

  2. If one role is spiking, look at COST_LOG per-agent. A misbehaving worker will show up as an outlier immediately — one agent_id with ten times the cost of its peers.

from collections import defaultdict
from swarm.core.records import COST_LOG

per_agent: dict[str, float] = defaultdict(float)
per_agent_calls: dict[str, int] = defaultdict(int)
for r in COST_LOG:
    per_agent[r.agent_id] += r.cost_usd
    per_agent_calls[r.agent_id] += 1
top10 = sorted(per_agent.items(), key=lambda kv: kv[1], reverse=True)[:10]
for agent_id, cost in top10:
    calls = per_agent_calls[agent_id]
    print(f"{agent_id:30s} ${cost:>8.4f}  ({calls} calls, ${cost/calls:.4f}/call)")
  1. If one task is spiking, pull the transcript for that task_id. TranscriptStore.grep() in swarm/memory/transcripts.py will find every line that mentions it. Three patterns show up repeatedly: a loop that never terminates (see Scenario 5), a retry budget that never resets (see modules/01_raw_call/exercises/02_retry_budget.py), or a user who pasted a novel into the prompt. The transcript is the fastest way to tell which.

Prevention. Budget caps belong on the critical path, not in a dashboard you check weekly. Register a post_agent_call handler that reads get_cost_summary()["total_usd"] on every call and raises HookAbortError at 100% of the daily budget, logging a warning at 80%. The hook bus in swarm/hooks/bus.py will propagate the abort up through the loop, and the orchestrator will see a clean failure instead of a silent overrun. For long-running swarms, reset the tracker nightly with reset_cost_state() so the budget is per-day rather than per-process.


Scenario 2: Agent accuracy dropped after a provider change#

Symptom. You switched the worker role from Sonnet to Haiku to save money. The eval harness pass rate dropped from 85% to 62%. The cost dropped too, but not enough to justify the regression.

Triage.

  1. Re-run the eval on both models explicitly. EvalHarness in swarm/eval/harness.py takes a model argument on .run(), so run twice and compare.
from swarm.eval.harness import EvalHarness

harness = EvalHarness(cases=my_cases, judge=my_judge)
run_sonnet = await harness.run(model="claude-sonnet-4-6", system=SYSTEM)
run_haiku = await harness.run(model="claude-haiku-4-5-20251001", system=SYSTEM)
print(harness.compare(run_sonnet, run_haiku))

compare() returns score_delta, pass_rate_a and pass_rate_b, cost_delta_usd, and latency_delta_ms. Use those numbers, not vibes.

  1. Look at per-case breakdown. EvalRun.results is a list of EvalResult; each has case_id, passed, score, and output. Filter to cases that passed on Sonnet and failed on Haiku. Do not look at the aggregate pass rate until you have read ten of the regressions end to end.

  2. Classify the regressions. In practice they fall into four buckets: tool-call failures (the model emitted malformed JSON for tool_uses), wrong format (correct content, wrong structure), hallucination (plausible but wrong), and truncation (ran out of max_tokens before finishing). Each bucket has a different fix. Tag the failures manually on your first pass; the pattern will be obvious after ten cases.

  3. Check the system prompt. Prompts written against a stronger model often carry implicit assumptions — "think step by step before answering" assumes the model can hold the chain. Haiku's context is shorter and its reasoning depth is lower. A prompt that says "consider all relevant factors" will get a deeper answer from Sonnet than from Haiku, and that gap shows up as regressions on open-ended cases.

  4. Do a Pareto analysis. Compute cost-per-passed-case, not cost-per-call. total_cost_usd / passed is what matters. If Haiku costs one-fifth as much but gets two-thirds the pass rate, the cost per correct answer is still lower — but only if your workload can tolerate the wrong answers. For most production workloads it cannot.

Prevention. Pin the provider and model explicitly in your SwarmConfig.model_overrides. When you want to try a new model, use a canary: route 1% of traffic to the new model, run the eval harness on the canary output nightly, and only promote when the pass rate and cost are both acceptable. Never swap a role's model based on price alone.


Scenario 3: Memory is corrupting#

Symptom. The agent is confidently returning outdated facts. A user reports "you said my plan was Pro but I upgraded to Enterprise last month." You check the transcripts and see the upgrade was logged correctly yesterday. Today's answer references a stale topic file.

Triage.

  1. Check the index. MemoryIndex in swarm/memory/index.py is the top-level pointer list at memory/MEMORY.md. If the index is stale — it still points at a topic that was superseded, or it omits a topic that exists on disk — the agent will read the stale pointer and miss the fresh topic. Compare index.all_topics() against the actual .md files in the memory dir.
from pathlib import Path
from swarm.memory.store import MemoryStore

store = MemoryStore("./memory")
indexed = set(store.index.all_topics())
on_disk = {p.name for p in Path("./memory").glob("*.md") if p.name != "MEMORY.md"}
print("orphan topic files:", on_disk - indexed)
print("dangling index pointers:", indexed - on_disk)
for t in indexed & on_disk:
    topic = store.read_topic(t)
    print(f"{t}: {len(topic.body)} chars")
  1. Check for concurrent writes. Two daemons sharing one memory dir will both try to run maybe_dream(), both try to take the advisory lock at memory/dream.lock, and race on the topic files. The lock file contains a JSON blob with pid and ts; if you see a PID that is not yours and the timestamp is fresh, another process is writing. If the PID is gone but the file is there, acquire_lock() in swarm/memory/dream.py will clean it up after five minutes — but a corrupted write from a crash mid-consolidation can leave the topic file half-written in the interim.

  2. Check when autoDream last ran. memory/dream_state.json has a last_dream ISO timestamp. If it is more than 24 hours old, consolidation has not fired and your transcripts are full of facts that never got promoted to the topic layer. The triple-gate in should_dream() wants 5 sessions and 24 hours elapsed and no lock held; if any gate fails, nothing happens. Print the gate state to find out which.

  3. Walk the transcripts. Iterate the JSONL files under memory/transcripts/. A crashed write will leave a truncated final line that fails json.loads(). _entry_from_line() in swarm/memory/transcripts.py silently returns None on parse failures, which means corrupted entries are skipped rather than visible — you have to walk the raw lines yourself.

import json
from pathlib import Path

bad = []
for p in Path("./memory/transcripts").glob("*.jsonl"):
    for i, line in enumerate(p.read_text().splitlines()):
        try:
            json.loads(line)
        except json.JSONDecodeError:
            bad.append((p.name, i, line[:80]))
for name, i, preview in bad:
    print(f"{name}:{i}  {preview}")
  1. Check for size leak. du -sh memory/ should be roughly stable day over day once the swarm has warmed up. If it is growing unbounded, consolidation is firing but topics.delete() is not. Look at the Dream LLM's delete list in recent runs (the summary is in dream_state.json["last_summary"]) — if it is always empty, the prompt is not surfacing stale entries.

Prevention. One daemon per memory dir, always. Use a supervisor (systemd, launchd, Docker) that refuses to start a second instance. The advisory lock exists to cover rare races, not to enable parallel writers. Schedule consolidation on a cron or a bus event rather than relying on the organic session counter — for low-traffic deployments the session counter will never cross the gate.


Scenario 4: Safety rules are being bypassed#

Symptom. Your constitution has a rule that says "never mention competitor X". The agent just mentioned competitor X in a user-facing response. Support forwards you the transcript.

Triage.

  1. Check the audit log. make_audit_hook_for_event in swarm/hooks/audit_hook.py writes one line per hook event to ./logs/audit.jsonl. Each entry has the event name, timestamp, and agent_id. For the agent_id in question, pull every event around the timestamp of the violation and look for security_block and security_allow events — those are the signals that the constitution check fired.
import json
from pathlib import Path

target_agent = "worker_7"
for line in Path("./logs/audit.jsonl").read_text().splitlines():
    entry = json.loads(line)
    if entry.get("agent_id") == target_agent and entry["event"] in {
        "security_block", "security_allow", "post_agent_call",
    }:
        print(entry["ts"], entry["event"], entry.get("payload_keys"))

The audit hook records only keys, never values, by design — so you will see that a security_block fired but not what triggered it. To get the content, you need the transcript. TranscriptStore.grep() with the agent_id as the pattern is the fastest way.

  1. Check hook ordering. Handlers in HookBus run in registration order (sync first, then async, each group sequential). If a later hook mutates the message after the constitution check, the check was useless. Print bus.handler_count("post_agent_call") and look at the order handlers were registered. A compaction hook or a formatting hook registered after the security hook can rewrite content the security hook already blessed.

  2. Check for injection. User prompts can contain phrases from the DENYLIST in swarm/safety/injection.py — things like "ignore previous instructions", "new task:", "pretend you are". verify_tool_output() wraps untrusted text in an <tool_output_untrusted> block, but it only runs on tool output, not on the user's original request. If the injection rode in via user input and the model complied, the constitution rule had no chance.

  3. Check for model drift. The constitution rules in swarm/safety/constitution.py are regexes and they match the agent's action text. A model update can change how the agent phrases things — for example, if the model starts calling subprocess.run(["rm", "-rf", ...]) instead of rm -rf, rule C01's regex \brm\b.*-[^\s]*r|shutil\.rmtree will miss it. Check the last deploy log for a provider SDK bump or a model ID change.

  4. Review the rule's regex. "Never mention competitor X" is often implemented as re.search(r"competitor X", text, re.IGNORECASE). That matches "Competitor X" but not "X, our competitor" or "the other vendor". Walk through the actual offending output, paste it into a quick regex tester, and see which wordings slip through. check_constitution() returns the list of rules that fired — zero rules fired means your regex did not cover the surface area you thought it did.

Prevention. For high-stakes rules, add a second layer: a dedicated screening model (Haiku is cheap and fast enough) that reads the final output with a prompt like "Does this response mention competitor X in any form? Answer YES or NO." Call it from a post_agent_call hook; abort the chain on YES. This is the dual-LLM defense pattern. Also: run check_denylist() on the user's input before dispatch, not only on tool output.


Scenario 5: Agent loops indefinitely#

Symptom. A worker task has been running for forty-five minutes. LoopState.iterations keeps climbing. The logs show the same tool being called with slight variations — different filenames, different query strings, but the same tool with the same error class coming back.

Triage.

  1. Check the iteration ceiling. run_loop in swarm/core/loop.py takes max_iterations with a default of 10. If a custom loop was written for this worker, check whether the ceiling was bumped to a high number "just for testing" and never lowered. Ten is a reasonable production default; fifty is not.

  2. Check the tool-error ceiling. LoopState only tracks iterations and tool_calls_made — it does not track consecutive failures. A loop that keeps calling read_file("/path/that/does/not/exist") will happily iterate ten times without hitting any explicit failure. Wrap the loop state so you count consecutive errors and break early:

from dataclasses import dataclass, field
from swarm.core.loop import LoopState

@dataclass
class ErrorAwareLoopState(LoopState):
    consecutive_errors: int = 0
    max_consecutive_errors: int = 3
    last_error_signature: str = ""

    def record_tool_result(self, tool_name: str, result: str) -> None:
        signature = f"{tool_name}:{result[:80]}"
        if result.startswith("ERROR") or "error" in result.lower()[:20]:
            if signature == self.last_error_signature:
                self.consecutive_errors += 1
            else:
                self.consecutive_errors = 1
            self.last_error_signature = signature
        else:
            self.consecutive_errors = 0
            self.last_error_signature = ""

    @property
def should_break(self) -> bool:
        return self.consecutive_errors >= self.max_consecutive_errors
  1. Check the system prompt. Agents loop because they do not know when to stop. A good system prompt has explicit termination rules: "If you cannot complete the task in three tool calls, return the partial result with an explanation." Without a stop rule the model will keep trying — and the loop's max_iterations is the only safety net.

  2. Check the tool registry. A tool that returns a slightly different error every time is a honeypot for infinite loops. If read_file returns "file not found at /a/b/c" on one call and "cannot locate /a/b/c" on the next, the agent treats them as different problems and keeps trying. Normalize error messages in the tool implementation (swarm/tools/registry.py is where dispatch happens) so identical failures look identical.

  3. Check network timeouts. A tool that calls out over the network without a timeout will hang for whatever the OS socket default is. The loop does not see a hang as a failure — it sees a long, successful call. Every network tool needs an explicit timeout, and the timeout should be shorter than the outer loop's overall deadline.

Prevention. Circuit-breaker pattern: kill the loop after K consecutive errors and after T seconds of wall-clock, whichever comes first. Add explicit termination rules to the system prompt ("If the task cannot be completed in N steps, return a partial result and stop."). Emit a worker_failed event from the break branch so the orchestrator can route the task elsewhere or escalate to a human.


Closing: triage checklist#

Bookmark this. When something breaks in production, read it before you open any code.

  • Did I check ./logs/audit.jsonl for the agent_id and timeframe in question?
  • Did I rank-order the top-10 cost contributors via COST_LOG?
  • Did I run EvalHarness.compare() on the model before and after the change I suspect?
  • Did I verify memory integrity — index consistent, no orphans, no parse errors in JSONL transcripts?
  • Did I check for DENYLIST patterns in the user's input, not only tool output?
  • Did I inspect the hook registration order to confirm the security hook runs last?
  • Did I confirm max_iterations is set and sane for every custom loop?
  • Did I check for tools without network timeouts?
  • Did I pull the raw transcript for the offending task_id and read it end to end?
  • Did I print get_cost_summary() during the problem run, not only after?

The checklist assumes you have observability wired up before the incident. If you do not, wire it up now. setup_telemetry() in swarm/observability/telemetry.py takes a service name and configures OpenTelemetry against either an OTLP endpoint (set OTEL_EXPORTER_OTLP_ENDPOINT) or a console exporter as a fallback. Every call_agent produces a gen_ai.* span via record_agent_span(), and every hook event produces a hook.{event} span. Trace IDs propagate through the hook bus via the _trace_id payload key, so cross-agent emissions share a trace.

Ship traces to a collector you can query. Grafana Tempo, Honeycomb, Datadog, Jaeger — any of them will let you filter by swarm.task_id and see the full fan-out of a single user request across orchestrator, workers, and hooks. Without that, you are reading audit lines with grep and hoping you picked the right timestamp window. With it, the five scenarios above become five saved queries.

The audit log, the cost log, and the trace exporter are cheap. Turn them on before you need them.