Appendix F: Production Hardening#
The book teaches you what an agent system IS. This appendix teaches you what it takes to run one in production at a company with real customers, real money, and real regulations. The gap is mostly operational, not conceptual. If you can read the daemon code from Chapter 8 and the hook bus from Chapter 7, you can ship the patterns here.
1. Multi-tenant cost isolation#
The CostTracker in swarm/hooks/cost_hook.py attributes cost per run. That is fine for single-user development. It is not fine for a SaaS. One abusive user running a runaway loop can blow the whole account's monthly budget before lunchtime on a Tuesday.
You want per-tenant budgets with kill switches. The swarm package ships CostGovernor (see swarm/hooks/cost_governor.py) for exactly this:
from swarm.hooks.cost_governor import CostGovernor, CostBudget
budgets = {
"acme-corp": CostBudget(user_id="acme-corp", limit_usd=50.00, period="monthly"),
"widget-inc": CostBudget(user_id="widget-inc", limit_usd=10.00, period="daily"),
"trial-user": CostBudget(user_id="trial-user", limit_usd=0.50, period="per_run"),
}
governor = CostGovernor(budgets)
bus.on("post_agent_call", governor.hook_handler)
Callers tag each agent call with user_id in the payload. The governor reads it, attributes the cost, and raises BudgetExhausted when a user hits their cap. If you set hard_kill=False, the governor logs the violation without interrupting; use that when you want observability without angering paying customers who are a few cents over.
Chargeback attribution. The ./logs/cost_governor.jsonl file is an append-only record of every cost event. Each line has user_id, cost_usd, model, run_id, and timestamp. Bill customers by grouping on user_id and summing cost_usd per period. This is the simplest chargeback pipeline you can build and it is good enough until you outgrow it.
When to use which kill policy. Hard kill for trial users and per-run budgets: a single abusive run should not cost real money. Soft kill (log only) for long-standing paid customers: an alert to your finance team beats a frustrated CEO on the phone. Both policies can coexist in the same CostGovernor instance; budgets are independent.
2. Secrets management#
A .env file is fine for one developer. It is not fine for a team, and it is absolutely not fine for production. You need:
- A secrets backend (Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, Doppler).
- A loader in the daemon that fetches at startup and refreshes on a schedule.
- Key rotation without restarting the daemon.
Sketch of the pattern, using a Vault-style client interface:
import asyncio
from swarm.daemon.kairos import KairosDaemon
class SecretsLoader:
def __init__(self, client, refresh_s: int = 3600):
self.client = client
self.refresh_s = refresh_s
self._keys: dict[str, str] = {}
async def start(self):
await self._reload()
asyncio.create_task(self._refresh_loop())
async def _reload(self):
self._keys["ANTHROPIC_API_KEY"] = await self.client.get("anthropic/prod")
self._keys["OPENAI_API_KEY"] = await self.client.get("openai/prod")
# ... more keys
async def _refresh_loop(self):
while True:
await asyncio.sleep(self.refresh_s)
await self._reload()
def get(self, name: str) -> str:
return self._keys[name]
Key rotation without restart. The refresh loop pulls new values on a cadence. When you rotate a key in Vault, the daemon picks it up within refresh_s seconds. No deploy, no downtime. Pair this with a brief grace window on the old key at the backend so in-flight calls do not fail.
Per-worker key isolation. Some workers call Anthropic, some call OpenAI. If one key leaks, you want the blast radius limited. Scope keys per worker class: anthropic-worker-pool has the Anthropic key in its env and cannot see the OpenAI key. Kubernetes secrets with RBAC on the ServiceAccount per pool make this straightforward.
3. Multi-region failover#
Production agents get paged at 3 a.m. when api.anthropic.com has a regional outage. Do not build a system that relies on one region being up.
Circuit breaker pattern. Track recent failures per provider-region pair. After N failures within a window, open the circuit: redirect all traffic to the fallback, no calls to the dead region. Probe every M seconds with one request; close the circuit on success. The industry reference is Hystrix; you can write a 50-line version in Python that is good enough.
Regional routing. Keep a static list of preferred regions per provider. Prefer US-East for Anthropic if your users are in North America; fall back to EU-West on 5xx; fall back to the cached mock responses if both are down and you need to keep serving degraded but non-broken answers.
Latency-aware routing. Beyond failover, pick the fastest-responding region for each worker pool. A simple sliding window of the last 100 p99 latencies per region suffices. Pick the minimum. This is cheap; it ships real user-visible wins on transcontinental traffic.
Short sketch:
class RegionalRouter:
def __init__(self, regions: list[str]):
self.regions = regions
self.broken: dict[str, float] = {} # region -> when-to-retry epoch
self.p99: dict[str, float] = {r: 0.0 for r in regions}
def pick(self) -> str:
now = time.monotonic()
live = [r for r in self.regions if self.broken.get(r, 0) < now]
if not live:
return self.regions[0] # all broken; try the primary anyway
return min(live, key=lambda r: self.p99[r])
4. Kubernetes deployment#
The daemon in Chapter 8 is a long-lived process. Kubernetes is where most production agent teams run long-lived processes. The Helm chart outline:
chart/
Chart.yaml
values.yaml # image, replicas, budget, hook config
templates/
deployment.yaml # the daemon pod
service.yaml # for health checks
configmap.yaml # non-secret config
secret.yaml # references external secret manager
hpa.yaml # horizontal pod autoscaler for workers
serviceaccount.yaml # RBAC
Liveness probe. Hit a /health endpoint that fails after three consecutive tick failures. In the daemon, wire a hook on swarm_tick_complete that writes a timestamp to /tmp/daemon_last_tick. The probe checks that the timestamp is less than 60 seconds stale. Three consecutive tick failures means the daemon is wedged and Kubernetes should restart it.
Readiness probe. Stricter than liveness. The pod is ready only when: database connection succeeds, at least one API key is valid (one test call returns 200), hook bus has registered all expected handlers. A pod that is live but not ready does not receive traffic.
HPA for worker pods. Scale worker replicas based on queue depth, not CPU. CPU is misleading because workers are I/O bound (waiting on LLM responses). Expose swarm_queue_depth as a metric and scale on it. Prometheus Adapter reads the metric; the HPA scales replicas from 2 to 20.
SIGTERM grace period. Kubernetes gives pods a 30-second grace period on terminate by default. The daemon must stop accepting new work immediately and finish in-flight tasks in that window. Chapter 8's shutdown handler already does this; what you add is terminationGracePeriodSeconds: 45 in the pod spec so Kubernetes waits long enough.
Log aggregation. Do not log to disk in a pod. The pod gets replaced; the logs vanish. Log to stdout/stderr as JSON lines and let Fluent Bit, Vector, or the cloud's native collector ship them to Datadog, Loki, Splunk, or CloudWatch. The daemon's ./logs/kairos.jsonl file is a development convenience; in production it is a stdout stream.
5. Incident response automation#
Agents fail in ways that no single stack trace captures. You need detectors that aggregate across events.
Patterns to detect:
- Same error code N times in a sliding window (repeat errors)
- Cost rate exceeding M times baseline (cost spike)
- p99 latency exceeding a threshold for K consecutive ticks (latency regression)
- A specific user's call volume spiking 10x (abuse or bug)
Wire an IncidentMonitor as a hook-bus consumer that subscribes to every event type and runs pattern matchers in the background. When a pattern fires, it publishes an incident_detected event with a type, severity, and payload.
Escalation paths. Pluggable adapter: one IncidentMonitor, one event, many adapters. Slack for low severity, PagerDuty for high, email for everyone. The decision is config, not code:
escalations:
repeat_error:
severity: low
adapters: [slack]
cost_spike:
severity: medium
adapters: [slack, email]
latency_regression_p99:
severity: high
adapters: [pagerduty, slack]
Runbook auto-generation. Given an incident type, emit suggested remediation commands. For a repeat-error incident: "grep ./logs/kairos.jsonl for user_id X; disable hook Y; restart worker pool Z." The runbook is a template; fill in the variables from the event payload. Operators love this because the first 30 seconds of an incident is always "where do I start?"
6. PII handling and GDPR#
Agents log everything by default. Transcripts contain user emails, phone numbers, and credit card fragments. You need to handle this before a regulator calls.
Log sanitization. Before any write, run the content through a sanitizer that strips common PII:
import re
PII_PATTERNS = [
(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]"),
(r"\b\d{3}-\d{3}-\d{4}\b", "[PHONE]"),
(r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"),
(r"\b\d{3}-\d{2}-\d{4}\b", "[SSN]"),
]
def sanitize(text: str) -> str:
for pattern, replacement in PII_PATTERNS:
text = re.sub(pattern, replacement, text)
return text
Wire this into the audit hook (swarm/hooks/audit_hook.py) so it runs on every log write. Preserve the original in a separate encrypted store if you need it for debugging; sanitize before disk or network.
Retention policy. Transcripts older than N days get deleted. Configure per-tenant: some contracts require 90 days, some require 7. A background job walks ./memory/transcripts/ once a day, checking file timestamps against the per-tenant policy.
Right-to-delete. A user asks you to delete their data. You need an API that walks all layers (transcripts, index, topics, consolidated memories) and removes anything tagged with that user_id. Keep a deletion log for regulatory proof.
Audit trail. Every data access (who called which tool, who read which memory topic) needs a log entry with actor, resource, action, timestamp. This is what satisfies auditors and what you use internally when a breach is suspected.
7. Operational runbook checklist#
Before going live, walk through this list. If any item is unchecked, either check it or write a decision record explaining why not.
-
CostGovernorconfigured with budgets per tenant - Secrets loaded from a real backend, not
.env - Key rotation schedule documented and tested (actually rotate a key in staging)
- At least two regions configured with failover probed regularly
- Liveness and readiness probes on every pod
- HPA on the worker pool with queue-depth metric
- SIGTERM grace period long enough for the longest in-flight task
- All logs ship to a central aggregator (Datadog, Loki, Splunk, CloudWatch)
- Cost dashboard visible to the team; alert fires on budget 80%
- Incident monitor configured with at least three patterns (repeat errors, cost spike, latency regression)
- Escalation paths routed (Slack, PagerDuty, email) with on-call rotation
- PII sanitizer wired into the audit hook
- Retention policy enforced by a daily job
- Right-to-delete API endpoint and test
- Audit trail written for every tool call and memory access
- Runbook written for the three most likely incidents
- Chaos test: kill one pod at random, verify recovery
- Load test: 10x expected peak traffic, verify no cost governor surprises
This list is not exhaustive. It is the floor. Every operational surprise you have in production will be because of something not on this list; treat that as input to expand it.