Chapter 04: State & Collaboration#
Prerequisites: Chapter 03 (Agent Loop, Tools & MCP)
In this chapter - Why a single agent has two structural blind spots: it forgets, and it cannot verify its own work - How a three-layer memory system (episodic, semantic, archival) extends the agent past the context window - A build-along tutorial: raw JSONL transcripts, a key-value index, and a consolidation cycle (we call this autoDream, after sleep cycles in neuroscience) - Why a second agent with a different role prompt catches mistakes the first one cannot - The generator and critic loop, its exit condition, and when it breaks - Why the "APPROVE" signal from a critic is not a measurement, and what Chapter 05 does about it
1. Two Blind Spots#
Your Chapter 03 agent can run tools. It reads files, executes bash, calls MCP servers. Each session works. Each session also begins from zero.
Tell it "we're refactoring the auth module." Close the process. Start a new one. Ask what you were working on. It has no idea. Every call to client.messages.create() is independent. There is no server-side session, no persistent tape. If the agent needs to know something across calls, you have to put it in the request body, every time.
This is the first blind spot: forgetting. Without persistence, the agent cannot learn from its own traces, cannot recall prior decisions, cannot avoid repeating mistakes. The context window is not storage; it is a working set that empties when the process exits.
The second blind spot is subtler. When an agent reviews its own output, it reads with the same context, the same blind spots, the same unstated assumptions that produced the output in the first place. Ask the model to critique its own code and it will usually approve. Madaan et al. (2023, arXiv:2303.17651) showed that the same model given a separate review prompt catches problems the first pass missed. You do not need a bigger model. You need a second pass with a different role.
Both problems are the same shape. A single agent cannot see past itself. Memory extends it into the past; a second agent with a different role extends it sideways, into a perspective it does not naturally hold. Both are ways of breaking the single-agent bottleneck. Both are cheap enough that you should reach for them before reaching for a larger model.
This chapter builds both. First, a three-layer memory system with a consolidation cycle: episodic transcripts, semantic topic files, and an index that lives in the system prompt. Then, a generator and critic loop that refines output until the critic approves, with a conservative cap to prevent infinite refinement. By the end, your agent persists across sessions and checks its own work through a second pair of eyes. The code is under 400 lines; the concepts carry through to every subsequent chapter.
2. Why Memory#
The context window is not a database. It is a working set of up to roughly 200K tokens, refreshed on every call, priced per token. You cannot dump everything into it. You cannot persist anything out of it. When the process exits, the context is gone.
The pain shows up in three ways. First, conversational amnesia: the agent greets you fresh every session, forgets what project you were working on, re-asks questions it already answered. Second, repeated mistakes: without a record of prior failures, the agent tries the same broken approach twice in the same week, sometimes in the same session. Third, no learning from traces: every successful solve dissipates the moment the session ends, even though the trace itself is the cheapest possible training signal, one you already paid to produce.
The fix is not more context. Doubling the window from 200K to 400K doubles the per-call cost and still does not persist across sessions. At Sonnet pricing of around $3 per million input tokens, filling a 200K window costs $0.60 per call just to load the context, before the model emits anything. Throw the same data in a 400K window and you have paid $1.20 per call, for the same non-persistent scratch space. The economics push the other way: keep the context small, keep the store on disk, load selectively.
Memory is what the agent carries from one session to the next. It reads at session start, writes during the session, and loads relevant bits back into the next one. The concrete question is: what to read, what to write, what to load back. Answering it well is the point of the rest of this part of the chapter.
3. Three Layers of Memory#
Memory systems for agents separate into three layers by access frequency and lifetime.
graph TD
subgraph HOT ["Hot Layer — Always Loaded"]
IDX["Memory Index<br/>Pointer-only, ≤30KB<br/>Lives in system prompt"]
end
subgraph WARM ["Warm Layer — Loaded on Demand"]
TOP["Topic Files<br/>Semantic facts, ≤25KB each<br/>Loaded when index says relevant"]
end
subgraph COLD ["Cold Layer — Never Loaded Whole"]
TRX["Transcript Log<br/>Episodic JSONL, unbounded<br/>Grep only"]
end
IDX -->|"pointer names topic"| TOP
TOP -->|"references specific turn"| TRX
style HOT fill:#ff9966,color:#000
style WARM fill:#ffcc66,color:#000
style COLD fill:#6699cc,color:#fff
Episodic (cold): raw transcripts of every turn. Append-only, unbounded, grep-searched. This is the long-term audit trail. Nothing here is loaded into context by default. You grep it when you need a specific past exchange, and the grep is line-by-line so memory stays constant even when the file grows to hundreds of megabytes.
Semantic (warm): named topic files containing extracted facts and summaries. Each file is under 25KB (roughly 6,000 tokens), loaded only when the agent's current task references it. This is the curated knowledge base. A topic file is a Markdown document with a small YAML header for metadata (name, description, type, last-updated) and a body that reads like a note you would write to yourself.
Archival (hot): a tiny index, always in the system prompt, pointer-only. Each line is a short name plus a one-sentence description of a topic file. At 200 lines by 150 characters the index stays around 30KB, a fixed cost per call regardless of how much total memory you have. The cap is not a limit, it is a contract: every entry must be a pointer, not a description. If you need more than 150 characters to describe something, it belongs in a topic file, not the index.
The distinction between episodic and semantic comes from Endel Tulving (1972). Episodic memory is specific events ("what happened at 3pm yesterday"). Semantic memory is general facts ("Paris is the capital of France"). When you hear "Paris" you do not replay every trip there; you pull the semantic fact and reach for episodic detail only when context demands it. The three-layer system does the same: the index answers "what do I know about", topic files answer "what is the summary", transcripts answer "show me the exact turn".
The OS analogy tightens this. Packer et al. (2023, MemGPT, arXiv:2310.08560) called an LLM with tiered memory an "operating system for language". The index is RAM, always addressable. Topic files are disk, loaded on demand. Transcripts are tape, scanned rarely. The three-layer system in this chapter is a simplified, filesystem-based MemGPT: no database, no embedding server, standard library only. You can run it on a laptop and grep it by hand, which matters more for debugging than for performance.
The layers solve different problems together, and each one alone fails. If you only had transcripts, every retrieval would scan megabytes of JSONL and the agent would pay the scan cost on every turn. If you only had topic files, they would accumulate contradictions because nothing tracks what replaces what; the agent writes a note in March about the auth module, rewrites it in April, and neither note knows about the other. If you only had an index, it would be shallow descriptions with no detail. All three, plus a consolidation cycle that moves information between them, gives you a memory system that stays small in context and grows on disk.
4. Build-Along: A Three-Layer Memory System#
The rest of this section builds the system in swarm/memory/. Each layer gets a concept, a minimum viable implementation, and a pointer to the full file in the repo. At the end we run autoDream and inspect the output.
4.1 Episodic: An Append-Only Transcript Log#
The episodic layer is a JSONL file, one JSON object per line. Append-only, never rewritten, scanned line-by-line. JSONL is not cosmetic: a crash mid-write to a JSON array leaves the file unparseable (the closing ] is missing, and all prior turns become unreadable), whereas a crash mid-write to JSONL leaves one bad last line that you skip on restart. All previous turns stay intact. This is the same principle as PostgreSQL's Write-Ahead Log and Kafka's log segments (Kleppmann, 2017, ch. 3): append-only logs are crash-consistent by design.
Minimum viable implementation:
import json
from datetime import datetime, timezone
from pathlib import Path
async def log_turn(path: Path, agent_id: str, role: str, content: str) -> None:
entry = {
"ts": datetime.now(tz=timezone.utc).isoformat(),
"agent_id": agent_id,
"role": role,
"content": content,
}
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("a", encoding="utf-8") as f:
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
def tail(path: Path, n: int = 100) -> list[dict]:
if not path.exists():
return []
lines = path.read_text(encoding="utf-8").splitlines()
return [json.loads(l) for l in lines[-n:] if l.strip()]
Two operations: log_turn appends, tail reads the last N lines. Notice what is not here. There is no rewrite, no compaction, no deletion. Once a line is written it stays. That is exactly what you want for an audit trail.
The real module at swarm/memory/transcripts.py (lines 36–132) adds three things. First, it shards by date: one JSONL file per day (2026-04-21.jsonl), so grep on recent history does not scan years of logs. Second, it uses aiofiles for async writes, because the agent loop writes after every turn and a blocking write on the hot path stalls concurrent calls in a multi-agent swarm. Third, it exposes grep(pattern, days_back=7) which compiles a regex once and scans the last N days line-by-line, never loading a full file into memory. At 100K lines per day, the scan stays under 100ms and uses constant memory.
4.2 Semantic: A Key-Value Index#
The semantic layer is a set of named Markdown files with YAML frontmatter, plus an index that maps topic names to one-line descriptions. The index is what lives in the system prompt; the topic files are what you load on a hit.
The index is a flat text file, one pointer per line:
from pathlib import Path
_MAX_LINES = 200
_MAX_LINE_CHARS = 150
class MemoryIndex:
def __init__(self, memory_dir: Path):
self.path = memory_dir / "MEMORY.md"
def upsert(self, topic_file: str, description: str) -> None:
line = f"- [{topic_file}]({topic_file}) — {description}"
if len(line) > _MAX_LINE_CHARS:
line = line[:_MAX_LINE_CHARS - 1] + "…"
lines = self.path.read_text(encoding="utf-8").splitlines() if self.path.exists() else []
lines = [l for l in lines if not l.startswith(f"- [{topic_file}]")]
lines.append(line)
if len(lines) > _MAX_LINES:
lines = lines[-_MAX_LINES:]
self.path.write_text("\n".join(lines) + "\n", encoding="utf-8")
def lookup(self, topic: str) -> str | None:
if not self.path.exists():
return None
for line in self.path.read_text(encoding="utf-8").splitlines():
if f"- [{topic}]" in line:
return line
return None
Three rules enforced in upsert: truncate long lines (every entry is a pointer, not a description), dedupe by topic name (upsert, not append), and cap at 200 lines (when full, drop oldest). The cap is a cost contract. At 200 lines × 150 chars the index stays around 30KB, which is roughly 7,500 tokens per system prompt. Uncapped, a busy agent's index grows to thousands of lines within a month and every call pays for it. At ten concurrent workers each loading the index, an uncapped 1,000-line index costs $1.10 per orchestration cycle on Sonnet; the capped version costs $0.30. Same agent, same output, three and a half times the budget, just because nobody drew a line.
Topic files live next to the index. A typical file looks like this:
---
name: Auth Module Refactor
description: Refactoring token expiry validation in auth module
type: project
---
## Overview
Refactoring the auth module to fix token validation.
## Key issues
- Token expiry not checked in 3 places
- No test coverage for edge cases
Read-write-delete on these files is ordinary text I/O. The only enforcement is the 25KB per-file cap, which matters because a topic loaded into context contributes directly to per-call cost. Full implementation at swarm/memory/index.py lines 12–112 for the index, and swarm/memory/topics.py for frontmatter parsing, type validation, and the per-file cap.
4.3 Archival: Consolidation via autoDream#
Raw transcripts accumulate faster than topic files. Without consolidation, the index fills with low-value entries ("user said hi", "agent said ok"), topic files go stale, and the system prompt grows noisy.
The consolidation cycle reads recent transcripts, asks a model to extract durable facts, and writes them into new or updated topic files. We call this autoDream, after sleep cycles in neuroscience (Walker, 2017). The name is a mnemonic; what it does is memory consolidation. Going forward we just say "the dream" or "consolidation".
Consolidation is expensive, so we gate it aggressively. The triple gate: at least 24 hours since the last dream, at least 5 sessions logged, and no advisory lock held by another process.
async def should_dream(memory_dir: Path, *, session_count: int) -> bool:
state = _load_state(memory_dir)
if max(state.get("session_count", 0), session_count) < 5:
return False
last = state.get("last_dream")
if last:
elapsed = datetime.now(tz=timezone.utc) - datetime.fromisoformat(last)
if elapsed < timedelta(hours=24):
return False
if (memory_dir / "dream.lock").exists():
return False
return True
The gates matter independently. Without the time gate, a chatty session can trigger two dreams an hour apart, each consolidating the same transcripts into slightly different topic files. Without the session gate, a single-turn session fires a dream on thin input and ships a topic file that says nothing. Without the lock gate, two workers running consolidation in parallel both read the same state, both write topic files, and whoever finishes last silently clobbers the other's output.
The lock file stores a PID and a timestamp. On a contending write, we check os.kill(pid, 0) to distinguish a live lock from a stale one left by a crashed process. Signal 0 is a no-op that just tests process existence: returns normally if the process is alive, raises ProcessLookupError if it is gone. That second case means the lock is stale and we can safely remove it. Without this check, a crashed worker freezes consolidation forever. Full gate and lock logic at swarm/memory/dream.py lines 80–183.
The dream itself runs four phases.
flowchart LR
A[Phase 1: Orient<br/>read index, topics,<br/>recent transcripts] --> B[Phase 2: Gather<br/>LLM emits JSON plan:<br/>upsert, delete, summary]
B --> C[Phase 3: Consolidate<br/>write topic files,<br/>update index pointers]
C --> D[Phase 4: Index<br/>update dream_state.json<br/>reset session counter]
Phase 1 (Orient) reads the index, a summary of every existing topic file, and the tail of recent transcripts. This is the context we hand to the model. Phase 2 (Gather) asks a model to emit a JSON plan: which topics to upsert with what body, which to delete, and a one-line summary of the changes. The prompt is locked to JSON output so the rest of the pipeline does not need an LLM to parse the plan. Phase 3 (Consolidate) applies the plan to disk: each upsert becomes a topic file write plus an index update; each delete removes the file and the pointer. Phase 4 (Index) updates dream_state.json with the new timestamp, a summary, and a reset session counter. The next dream will not fire until we accumulate five new sessions and 24 hours have passed.
4.4 Run It#
With the three layers wired together in MemoryStore, the full pipeline takes one file. Create a store, log five turns, write two topics by hand, trigger consolidation:
The output (condensed):
Step 1: Logging 5 conversation turns
logged: alice/user: Let's start working on the auth module refactor.
logged: agent/assistant: Sure. I'll start by reading the existing auth code
logged: alice/user: Focus on the token validation logic first.
logged: agent/assistant: Found 3 places where token expiry isn't checked.
logged: alice/user: Fix those and write tests for each case.
Step 2: Index contents (hot layer)
[2026-04-21T...] alice/user: Let's start working on the auth module refactor.
[2026-04-21T...] agent/assistant: Sure. I'll start by reading the existing auth code.
[2026-04-21T...] alice/user: Focus on the token validation logic first.
[2026-04-21T...] agent/assistant: Found 3 places where token expiry isn't checked.
[2026-04-21T...] alice/user: Fix those and write tests for each case.
Step 3: Writing 2 topics (warm layer)
Topics written: ['auth-module-refactor', 'project-context']
Step 4: Topic search
search('token') → ['auth-module-refactor']
search('FastAPI') → ['project-context']
Step 5: Transcript grep (cold layer)
grep('auth') → 2 match(es)
Step 6: autoDream consolidation
Dream complete (0ms)
Summary: [MOCK] Consolidated 5 transcript entries into Session Summary topic
Topics upserted: 1
- Session Summary
A few things to notice. The index after Step 1 contains five raw turn pointers. After Step 6, one new pointer appears at the bottom: [topic] Session Summary. Over many sessions, the dream compresses: it turns 50 raw-turn pointers into 2 or 3 topic pointers. The raw transcripts stay in the JSONL file forever; what changes is what the hot layer points to.
The mock summary is a fixed string because SWARM_MOCK=true short-circuits the LLM call. In real mode, the dream costs roughly \(0.01–\)0.05 per run and the summary reflects actual consolidated content. At the default gates (5 sessions, 24h) a personal agent dreams about once a day, which works out to roughly a dollar a month.
Inspect dream_state.json after the run:
{
"last_dream": "2026-04-21T07:00:00Z",
"session_count": 0,
"last_summary": "[MOCK] Consolidated 5 transcript entries..."
}
If last_dream is more than a week old and session count is high, consolidation is stalled. Usual suspects: a stale lock from a crashed process, a failing LLM call that keeps retrying and giving up, a gate that was never met because session count resets on each dream. Monitor this file; it is the single best indicator of memory health.
One more thing to watch for as the system runs: topic staleness. Topic files do not self-update. A topic about "auth module refactor" written in January may be misleading by March if the refactor was completed and rolled back. The updated timestamp in the frontmatter is the signal: flag topics older than N days as "possibly stale" in the system prompt, which tells the agent to validate before relying on them. The dream eventually overwrites stale topics with fresh ones, but only if recent transcripts touch the same subject matter. Subjects that fade out of conversation fade out of the memory system too, which is usually what you want.
5. Why Two Agents#
Memory extends the agent through time. A second agent extends it through perspective.
Here is the blindness problem, concretely. Ask a model to generate code, then ask the same model in the same context to critique the code. The critique almost always returns APPROVE. The model has already decided the code is fine (that is why it emitted it). A review prompt tacked onto the end inherits the generator's frame: same assumptions, same misreadings, same confidence. When people say "the model is confidently wrong", this is often where the confidence comes from. It has already committed.
Now change one thing. Start a fresh conversation with a different system prompt: "You are a code reviewer. Your job is to find problems. Return SCORE and RECOMMENDATION: APPROVE or REVISE." Give it only the code, not the generation history. This critic catches issues the generator rubber-stamped, because it has been primed into a different frame and has not been primed to defend anything. The weights are identical. The scratch is empty. The role prompt does the work.
Formally, if f_gen(x) produces code from task x and f_eval(y) evaluates code y, then:
The refined output scores at least as well as the original, and usually better, because the refiner has both the original code and a diagnosis of its flaws. Madaan et al. (2023, Self-Refine, arXiv:2303.17651) tested this on eight tasks (code optimization, math reasoning, dialogue, story generation, and more) across GPT-3.5, GPT-4, and LLaMA. The finding held everywhere, and critically: the same model taking both roles worked nearly as well as a separate feedback model. Role prompting does the work people assume requires a separate, larger model.
This is not magic. It is constraint. The critic system prompt restricts the model to a specific frame ("find problems") and a machine-readable output ("APPROVE | REVISE"). The frame does the thinking; the format does the control flow. Swap "find problems" for "be encouraging" and the critic goes soft and approves everything, even garbage; swap it for "be rigorous and list every violation of PEP 8" and the critic goes strict and rejects everything. The two-agent pattern is a way of picking a frame and holding it, which is hard for a single agent that has to generate and judge in the same breath.
6. The Generator and Critic Loop#
The loop has three roles played by the same model with different system prompts: generator, critic, refiner. The critic's recommendation controls the exit.
sequenceDiagram
participant Task
participant Generator
participant Critic
participant Refiner
Task->>Generator: task description
Generator-->>Task: code v1
loop max_rounds
Task->>Critic: code + task
Critic-->>Task: SCORE, ISSUES, RECOMMENDATION
alt APPROVE
Task->>Task: break
else REVISE
Task->>Refiner: code + critique
Refiner-->>Task: code v2
end
end
6.1 Role-Specialized Prompts#
GENERATOR_SYSTEM = """You are a skilled Python developer.
Given a task, write clean, correct Python code.
Output ONLY the code, no explanation."""
CRITIC_SYSTEM = """You are a code reviewer.
Given code, identify issues. Be specific.
Format your critique as:
SCORE: X/10
ISSUES:
- [issue 1]
- [issue 2]
RECOMMENDATION: APPROVE | REVISE"""
The structured critic format is load-bearing. SCORE: X/10 can be parsed without a second LLM call. RECOMMENDATION: APPROVE | REVISE gives the loop a machine-readable exit signal. Free-form critique forces you to either burn another LLM call on parsing or write brittle regex against prose, both worse than constraining the critic's output.
6.2 The Loop#
async def run_refinement_loop(
task: str, *, model: str, max_rounds: int = 3,
) -> RefinementState:
code = await run_generator(task, system=GENERATOR_SYSTEM, model=model)
rounds: list[dict] = []
for round_num in range(1, max_rounds + 1):
critique, recommendation = await run_critic(code, task, model=model)
rounds.append({"code": code, "critique": critique, "rec": recommendation})
if recommendation == "APPROVE":
return RefinementState(rounds=rounds, converged=True, final_code=code)
if round_num < max_rounds:
code = await run_refiner(code, critique, task, model=model)
return RefinementState(rounds=rounds, converged=False, final_code=code)
Three things to notice. First, APPROVE exits early: do not burn tokens if the critic is satisfied. max_rounds is the safety net, not the expected path. Second, the critic receives the task along with the code, because evaluating correctness requires knowing what the code is supposed to do. A critic who sees only the code can grade style but not whether the code does the right thing. Third, every round is recorded in rounds: the trace is the audit trail, the score progression, the critique per revision. When something goes wrong in production, this trace is what tells you whether the critic was too lenient, the refiner got stuck, or the generator produced nonsense on round one.
The score parsing stays deliberately simple. The critic writes SCORE: 7/10 on its own line; the loop reads the line by prefix and extracts the integer. Free-form critique would force a second LLM call just to extract the score, doubling cost per round for no benefit. When an agent's output feeds a parser, a router, or a loop controller, require machine-readable structure. The critic's SCORE line is for the loop; the ISSUES list is for the refiner; the RECOMMENDATION is for the exit.
Cost math for max_rounds=3: one generator call, plus up to three critic + refiner pairs, so seven calls maximum. On Haiku at typical sizes that is under a cent per task. On Sonnet it is around a cent. The loop is cheap enough to run in CI on every pull request, which also means it is cheap enough to run a thousand times a day in production without worrying about the bill.
6.3 A Concrete Run#
Task: "Write a Python function that finds all palindromes in a list of strings."
Round 1: SCORE 6/10 — missing input validation, no type hints
RECOMMENDATION: REVISE
Round 2: SCORE 8/10 — validation added, types added; edge cases?
RECOMMENDATION: REVISE
Round 3: SCORE 9/10 — edge cases handled
RECOMMENDATION: APPROVE
Final code:
def find_palindromes(strings: list[str]) -> list[str]:
if not isinstance(strings, list):
raise TypeError(f"Expected list, got {type(strings).__name__}")
return [s for s in strings if isinstance(s, str) and s == s[::-1]]
The convergence curve is uneven. Round 1 catches the obvious stuff (validation, types). Round 2 catches the next tier (edges). Round 3 approves. Madaan et al. reports the same diminishing-returns pattern across tasks: the first critique is worth the most. Beyond round 3 you are usually paying for noise rather than quality.
One architectural note. The loop above is the single-episode version of Reflexion (Shinn et al., 2023, arXiv:2303.11366), which extends the pattern with memory that persists across attempts: verbal feedback from past failures is injected into future generations. Reflexion reports 91% on HumanEval versus 80% baseline, an eleven-point gain from verbal feedback alone and no gradient updates. If you stitch this chapter's memory system to this chapter's critic loop, you have the core of Reflexion: the critic's critique becomes a transcript entry, a future dream pulls it into a "common pitfalls" topic file, and the next generator run reads that file before it starts writing. That composition is left as an exercise but the pieces are all here.
7. When Two Agents Doesn't Help#
The loop has four failure modes, two of them expensive.
Noise-not-signal. If the critic's bar is set low (a gentle system prompt, something like "be constructive"), it will approve borderline code on round 1 and ship the same bug the single-agent version would have shipped. The critic's adversarialness is entirely determined by its system prompt. A gentle critic is a proofreader, not a critic. One team shipped a legal document drafting agent with a critic primed to be "constructive and encouraging", and the critic approved documents containing the same boilerplate error in 30% of cases, because it had been primed to find the positive. Investigation revealed the system prompt had optimized for user comfort, not document quality. The fix was two-part: rewrite the critic prompt to list disqualifying criteria explicitly ("If the document is missing a jurisdiction clause, you MUST output REVISE"), and add the eval harness from Chapter 05 as an independent quality gate. Adversarialness has to be instructed; it is not a default.
APPROVE oscillation. The critic flips between APPROVE and REVISE on consecutive runs of the same code. Usually the cause is a vague rubric combined with a high-temperature model: the critic is guessing, and the guess flips. Lower the temperature on critic calls, or pin the critic to a deterministic scoring rubric where every criterion is either satisfied or not. If you cannot get stable verdicts on identical input, the signal is not measuring anything, and the loop is random noise dressed as quality control.
Infinite refinement. The task is unsolvable for the current model, the critic keeps finding real problems, the refiner keeps introducing new ones in the process of fixing the old. max_rounds catches this; set it conservatively (3 for simple tasks, 5 for complex). If more than 20% of production runs hit the cap without converging, your critic is too demanding, your refiner is too weak, or the specification is ambiguous. Fix the spec or the pairing, not the cap.
Cost at scale. A single generator call is one API hit. A generator-critic loop with max_rounds=3 is up to seven. On a low-traffic agent this is trivial. On a million-call-per-day system it is a 7x multiplier, which turns a $30/day line item into $210/day. Before defaulting every path to a critic loop, measure: does the loop actually improve the metric you care about, or does it just improve the critic's opinion of the output? Those are different numbers, and the loop optimizes the second one by construction.
That last point is the one that matters for Chapter 05. The loop emits "accept" signals, the critic says APPROVE. That signal does not measure quality against a reference. It is a proxy, and the critic is a proxy that can be flattered. Without a harness outside the loop, any claim about "pair A is better than pair B" is eyeballing, not evidence. If pair A's critic is easy and pair B's critic is strict, pair A converges faster, looks cheaper, and ships worse code. You will deploy the wrong pair and not know until a user complains.
Chapter 05 builds that harness: a held-out task set with reference answers, an LLM-as-judge with position-bias mitigation, and OpenTelemetry traces so every decision inside every loop is inspectable after the fact. Once you have it, "pair A versus pair B" becomes a number with a confidence interval. Until you do, the critic's APPROVE is an opinion, and the loop's job is only to produce opinions faster than a human reviewer.
8. What Goes Wrong & Onward#
The APPROVE signal is soft. The memory system stays clean only if the dream runs; if it stalls, the index fills with conversational noise and the system prompt gets louder for the same information. Both layers of this chapter, memory and collaboration, are self-reports. The store claims it remembered. The critic claims the code is good. Neither can grade itself.
Chapter 05 builds the eval harness that measures both properly: reference-scored quality, confidence intervals wide enough to distinguish signal from run-to-run noise, and traces that show where the loop actually spent its rounds and what the memory system actually retrieved at each step. After that chapter, "better" has a number with a confidence interval, and you stop deploying on vibes and start deploying on evidence.