Chapter 06: Orchestrator-Workers: Fork-Join, Sectioning, Voting & MoA#
In this chapter: - Why parallelism is not just a speed optimization; it's an architectural principle that changes how failures are isolated - The five workflow patterns from Anthropic's "Building Effective Agents": prompt chaining, routing, parallelization (sectioning + voting), orchestrator-workers, evaluator-optimizer - Build a fork-join swarm: one orchestrator plans, N workers execute in parallel, one verifier checks - Sectioning (Pattern 3a): split a large task into independent subtasks, gather, merge - Voting (Pattern 3b): run the same task N times independently, ship only if ≥K agree - Mixture of agents (the MoA pattern) and how it generalizes diversity into reliability - When to use workflows (predictable steps) vs. agent loops (unpredictable steps)
1. Motivation#
You have memory, tools, eval, and a critic loop. The bottleneck: everything is still sequential.
For a 10-unit project, sequential takes 10× longer than parallel. If each unit takes 8 seconds on Haiku, sequential is 80 seconds. With fork-join, it's 8 seconds plus overhead. That is not a micro-optimization. That is a different product.
The parallelism win is only part of it. The deeper point is architectural: independent units executing simultaneously gives you isolation. Each worker sees only its task. Workers can't corrupt each other's state. Failures are contained. Retries are surgical.
The pattern: one orchestrator plans, N workers execute simultaneously, one verifier checks. This is the architecture that powers every serious agentic system in production: Claude Code, GitHub Copilot Workspace, every coding assistant that "opens a PR" behind the scenes. Anthropic's "Building Effective Agents" guide (2024) maps the full taxonomy it belongs to.
2. First Principles#
The history of distributed task decomposition#
The pattern has roots in two independent intellectual traditions going back five decades.
MapReduce (Dean & Ghemawat, 2004). Jeff Dean and Sanjay Ghemawat at Google published "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004). The insight: many large-scale computations can be expressed as two phases: a map phase (apply a function to each input in parallel, producing intermediate key-value pairs) and a reduce phase (aggregate the intermediate results by key). The user writes only the map and reduce functions; the framework handles distribution, failure, retry, and result gathering. The fork-join swarm in this chapter is MapReduce at the LLM layer: orchestrator = planner (decides the decomposition), workers = mappers (execute independently in parallel), verifier = reducer (aggregates and checks).
The Actor Model connection. The orchestrator-workers pattern is an actor system in miniature: isolated workers communicate via messages, failures are contained. Hewitt (1973) formalized this; Erlang made it production-grade.
Workflow patterns: the full taxonomy#
Anthropic's "Building Effective Agents" guide identifies five core patterns:
Pattern 1, Prompt Chaining: A fixed sequence of LLM calls where the output of each step is input to the next. Use when the task has a natural linear structure: research, outline, draft, edit.
Pattern 2, Routing: A classifier LLM decides which downstream handler to invoke. A customer service agent classifies the request (billing? technical? cancellation?) and routes to a specialized sub-agent. Use when inputs are heterogeneous.
Pattern 3a, Parallelization: Sectioning: Split a large task into N independent subtasks and execute simultaneously. Gather and merge. Use when the task decomposes into non-overlapping chunks: N files, N endpoints, N sections.
Pattern 3b, Parallelization: Voting: Run the same task N times independently and take the majority answer. Use when correctness is critical and diversity of sampling helps: code review, factual verification, security audits.
Pattern 4, Orchestrator-Workers: A planner LLM dynamically decides what tasks need doing and spawns workers accordingly. Unlike Sectioning (fixed decomposition), the orchestrator adapts: if a worker fails, it can re-plan. Use when the decomposition itself requires intelligence.
Pattern 5, Evaluator-Optimizer: A generator produces output; an evaluator scores it; feedback is fed back to the generator. Use when quality can be objectively assessed and iterative refinement is appropriate.
When to use workflows vs. agent loops#
A key design decision precedes all five: should this be a workflow (fixed structure, predictable steps) or an agent loop (LLM decides next action)?
flowchart TD
START["New task arrives"]
Q1{"Can you predict<br/>the steps needed?"}
Q2{"Decomposable into<br/>independent subtasks?"}
Q3{"Need multiple<br/>perspectives on<br/>the same task?"}
Q4{"Subtasks<br/>heterogeneous?"}
WF_CHAIN["Prompt Chaining"]
WF_SEC["Sectioning"]
WF_VOTE["Voting"]
WF_ROUTE["Routing"]
AGENT["Agent Loop"]
OW["Orchestrator-Workers"]
START --> Q1
Q1 -->|"Yes"| Q2
Q1 -->|"No"| AGENT
AGENT --> OW
Q2 -->|"Yes, fixed split"| Q3
Q2 -->|"No, linear sequence"| WF_CHAIN
Q3 -->|"Yes (voting)"| WF_VOTE
Q3 -->|"No (just faster)"| Q4
Q4 -->|"Yes"| WF_ROUTE
Q4 -->|"No"| WF_SEC
style WF_CHAIN fill:#ffcc66,color:#000
style WF_SEC fill:#ffcc66,color:#000
style WF_VOTE fill:#ffcc66,color:#000
style WF_ROUTE fill:#ffcc66,color:#000
style AGENT fill:#ff9966,color:#000
style OW fill:#ff6644,color:#fff
Can you predict the steps? Yes → workflow (deterministic, cheaper, easier to debug). No → agent loop (flexible but more expensive).
Workflows compose: a sectioning workflow can contain routing sub-workflows; an orchestrator-workers loop can spawn sectioning sub-workflows. The taxonomy is a palette of primitives, not a strict hierarchy.
asyncio.gather is not just faster sequential#
asyncio.gather runs coroutines concurrently: each suspended at every await and resumed as I/O completes. For LLM calls (almost entirely network I/O), three 8-second API calls take ~8 seconds total, not 24.
The more important contract: workers are independent. They share no mutable state. Each receives inputs at call time and returns output at completion. This independence is what makes the pattern composable and safe.
The KV cache fork insight#
When all workers share the same system prompt and static context, Anthropic's prompt cache serves that prefix to every worker at 10% of normal input cost. The first worker pays cache-write; every subsequent worker pays only cache-read.
Anthropic keys the cache on the byte-exact prefix. If every worker has a byte-identical system prompt + static doc, they all hit the same cache entry. Only the dynamic part (the unit description) is new tokens.
Cost is O(1 + N + 1) LLM calls: one orchestrator, N workers, one verifier. Beyond N=8, diminishing returns outweigh the parallelism gain: tasks start depending on each other, and verifier load grows linearly with cost.
Canonical Source: MapReduce
Dean & Ghemawat (2004, OSDI): "MapReduce: Simplified Data Processing on Large Clusters." The orchestrator-workers architecture is MapReduce applied to LLM tasks.
research_and_plan()is the equivalent of the map function specification. Workers are the mappers. The verifier is a reduce step aggregating results into a verdict. The same fault tolerance properties apply: if a mapper (worker) fails, it can be retried independently. Google used MapReduce to run thousands of 4,000+ machine jobs daily. The same principles scale down to 3-8 LLM workers.
3. Build It#
Open code/swarm.py. Four phases, each a distinct async function.
@dataclass
class WorkUnit:
id: str
title: str
description: str
status: str = "pending" # pending | running | done | failed
output: str = ""
error: str = ""
latency_ms: float = 0.0
model: str = ""
Phase 1: research_and_plan#
The orchestrator receives the goal and a JSON schema, returns a structured plan. JSON parsing is the first failure point. LLMs don't always output valid JSON for complex nested structures. The fix:
- Strip markdown code fences (models often wrap JSON in
```json) - Try
json.loads() - On failure: return a single WorkUnit covering the whole goal
try:
data = json.loads(clean)
units = [WorkUnit(id=u["id"], title=u["title"], ...) for u in data["units"]]
return units
except (json.JSONDecodeError, KeyError, TypeError):
return [WorkUnit(id="unit_1", title="Complete goal", description=goal, ...)]
The fallback is critical: if the planner fails, the swarm still attempts execution with one worker. Partial execution beats a crash.
Phase 2: run_worker#
Each worker formats WORKER_SYSTEM_TEMPLATE with its unit fields, then calls the LLM. The static_doc parameter is the prompt caching hook: if provided, it's prepended so all workers share it. Exceptions are stored in unit.error instead of re-raised. A single worker failure should not abort the swarm. [full: swarm/agents/worker.py:20-100]
Phase 3: run_fork_join#
async def run_fork_join(units, *, ...) -> list[WorkUnit]:
coros = [run_worker(unit, ...) for unit in units]
raw_results = await asyncio.gather(*coros, return_exceptions=True)
return_exceptions=True is the key. Without it, any coroutine raise cancels all remaining coroutines. With it, exceptions are returned as values and converted to failed WorkUnits. The difference between "one failure kills the swarm" and "failures are isolated."
Phase 4: verify_results#
The verifier receives a summary of all unit outputs and their acceptance criteria, returns a verdict (PASS | FAIL | PARTIAL) and report. VERIFIER_SYSTEM explicitly instructs it to be adversarial, not rubber-stamp the workers. Parsing is simple text scanning for VERDICT: and REPORT: lines, not JSON: verifier output is read by humans.
4. Parallelization: Sectioning (Pattern 3a)#
You have a large task, you know upfront how to split it, you run simultaneously, gather, merge.
flowchart TD
TASK["Large Task"]
SPLIT["Static Decomposition"]
W1["Worker 1"]
W2["Worker 2"]
W3["Worker 3"]
WN["Worker N"]
GATHER["Gather"]
MERGE["Merge"]
OUT["Final Output"]
TASK --> SPLIT
SPLIT --> W1 & W2 & W3 & WN
W1 & W2 & W3 & WN --> GATHER --> MERGE --> OUT
Key properties of Sectioning: - Split is deterministic: you know the sections before any LLM call - Sections are non-overlapping: each worker handles exactly its piece - Merge is mechanical: concatenation, schema merge, file assembly
Sectioning is cheaper and simpler than full orchestrator-workers because the planner call is eliminated: you define the decomposition in code. Use it when the task structure is regular: N files, N test cases, N document sections.
async def run_sectioning(sections, *, worker_fn, merge_fn) -> dict:
results = await asyncio.gather(
*[worker_fn(s) for s in sections],
return_exceptions=True,
)
clean = []
for i, r in enumerate(results):
if isinstance(r, Exception):
clean.append({"id": sections[i]["id"], "error": str(r)})
else:
clean.append(r)
return merge_fn(clean)
Under the Hood: Static vs. Dynamic Decomposition
The difference between Sectioning and full Orchestrator-Workers is who decides the decomposition. In Sectioning, a human (the developer) writes code that splits the task. In Orchestrator-Workers, an LLM writes the split at runtime. Sectioning is cheaper, more predictable, and easier to test (unit-test
worker_fnwith fixed inputs). Orchestrator-Workers handles tasks where the right decomposition can't be known in advance. Choose Sectioning when the split is obvious; escalate when the split requires judgment.
5. Parallelization: Voting (Pattern 3b)#
Instead of splitting, you run the same task N times independently and take the majority answer.
flowchart TD
TASK["'Is this code safe to merge?'"]
SPAWN["Spawn N independent reviewers"]
V1["Reviewer 1: SAFE"]
V2["Reviewer 2: SAFE"]
V3["Reviewer 3: UNSAFE (spotted XSS)"]
VN["Reviewer N..."]
AGG["Majority vote (K of N)"]
DEC{"≥ K agree SAFE?"}
SHIP["Ship"]
BLOCK["Block, flag for human"]
TASK --> SPAWN
SPAWN --> V1 & V2 & V3 & VN
V1 & V2 & V3 & VN --> AGG --> DEC
DEC -->|"Yes"| SHIP
DEC -->|"No"| BLOCK
style SHIP fill:#aaffaa,color:#000
style BLOCK fill:#ffaaaa,color:#000
The intuition: a single LLM reviewer misses issues. Different runs make different errors; LLMs are stochastic. Run N reviewers independently; any can raise a flag. Require K of N to agree on "safe" before shipping. The more critical the decision, the higher K relative to N (K/N is your confidence threshold).
Canonical use case: code review before a PR merge. A single pass might miss a subtle SQL injection or race condition. Five independent passes have a much lower joint probability of all missing the same issue. For high-stakes merges, this small multiplier on cost buys meaningful safety.
async def run_voting(task, *, n_voters=5, k_threshold=4,
verdicts, voter_fn):
raw = await asyncio.gather(
*[voter_fn(task) for _ in range(n_voters)],
return_exceptions=True,
)
all_verdicts = []
for r in raw:
if isinstance(r, Exception):
all_verdicts.append("ERROR")
else:
matched = next((v for v in verdicts if v.upper() in r.upper()),
"UNKNOWN")
all_verdicts.append(matched)
from collections import Counter
counts = Counter(all_verdicts)
winner, winner_count = counts.most_common(1)[0]
consensus_reached = winner_count >= k_threshold
return winner, all_verdicts, consensus_reached
[full: swarm/core/voting.py or equivalent]
Example usage:
async def safe_to_merge(pr_diff: str) -> bool:
winner, verdicts, consensus = await run_voting(
task=f"Review this code diff and output SAFE or UNSAFE:\n\n{pr_diff}",
n_voters=5, k_threshold=4,
verdicts=["SAFE", "UNSAFE"],
voter_fn=review_code,
)
return winner == "SAFE" and consensus
Why voting works: each reviewer samples a different trajectory through the response space. The probability of N independent reviewers all missing the same bug is exponentially lower than a single reviewer missing it, assuming missed bugs are at least partially uncorrelated.
Temperature variation: same task, same model, temperatures [0.0, 0.3, 0.7, 1.0, 1.2] produces a spread of perspectives. The deterministic run (temp=0.0) anchors the answer. High-temperature runs (temp ≥ 1.0) explore edge cases the deterministic run might systematically miss. Combining deterministic and exploratory runs in a single voting pool improves coverage.
Cost trade-off: N=5 voters means 5× per-task cost. For high-stakes decisions (production merges, financial transactions), the reliability gain is worth it. K controls the trade-off: K=N (unanimity) is maximally conservative; K=1 is maximally permissive. The right K depends on the cost of false positives versus false negatives.
Anti-pattern: Voting Without Independence
Voting only improves reliability if voters are genuinely independent. If all five voters share the same bug in the prompt, all five make the same error. The vote is 5-0, but it's 5-0 wrong. Independence requires different random seeds (or temperatures), ideally different models. If you're running five copies at temperature 0.0 (deterministic), you're not voting; you're running the same computation five times and expecting a different result.
6. Mixture of Agents#
The mixture of agents pattern (the MoA pattern) runs the same task on N models in parallel, then has an aggregator synthesize a final answer:
# Phase 1: fan out
responses = await asyncio.gather(*[_worker(m) for m in models])
# Phase 2: aggregate
combined = "\n\n".join(
f"[Response {i+1}]:\n{r}" for i, r in enumerate(responses))
final = await call_llm_fn(system=AGG_SYSTEM, prompt=combined + task, ...)
Wang et al. (2024, arXiv:2406.04692) formalized this. Their finding: aggregating N weaker models can outperform a single stronger model on many benchmarks. The diversity-then-synthesis pattern generalizes beyond LLMs: ensemble methods in ML (bagging, boosting), Condorcet's jury theorem in political science (independent majority vote outperforms any individual as N grows).
Canonical Source: MoA Paper
Wang et al. (2024) use layered aggregation: multiple "proposer" agents generate candidates independently; an "aggregator" synthesizes the final answer by identifying consensus and resolving conflicts. Proposers don't see each other's outputs; the aggregator sees all of them. Three GPT-4-Turbo proposers + GPT-4-Turbo aggregator outperforms single GPT-4-Turbo on AlpacaEval 2.0, Flask, and MT-Bench. The aggregator has more signal than any individual proposer: it sees the distribution of answers and can use divergence as an uncertainty signal.
Different models make different errors. An aggregator that sees multiple independent answers can identify consensus versus uncertainty and synthesize rather than average. The same principle applies with temperature variation: temperatures 0.0/0.5/1.0 on the same model produce different samples; their consensus is more reliable than any single pass.
MoA vs. Voting: in Voting the aggregator is mechanical (count, return plurality); in MoA the aggregator is an LLM that synthesizes a final answer that may not match any individual response. Use MoA when the right answer requires synthesis; Voting when it's discrete and verifiable.
7. Production Detail: Git Worktree Isolation (Optional)#
Skip on first read. Relevant when parallel workers modify files on disk.
git worktree add creates a new working directory linked to the same git object store, no clone, no copy. Each worker gets its own directory, git index, and branch. Instant setup (shared object store), full git index, automatic cleanup (git worktree remove --force), branch isolation (each worker → its own PR). [full: swarm/core/worktree.py]
sequenceDiagram
participant O as Orchestrator
participant W1 as Worker 1
participant W2 as Worker 2
participant G as git object store
O->>G: git worktree add worktree-1/ branch-1
O->>G: git worktree add worktree-2/ branch-2
par Worker 1
W1->>W1: write files
W1->>G: git commit → push branch-1
and Worker 2
W2->>W2: write files
W2->>G: git commit → push branch-2
end
O->>G: git worktree remove worktree-1/ worktree-2/
Stale Worktrees. If a worker crashes mid-operation, the worktree may remain on disk. Stale worktrees don't cause data loss; they just clutter. Prune with
git worktree prune && git worktree list. Run periodically in CI cleanup.
In mock mode, git commands are skipped and a temp directory is passed. work_fn receives the path either way.
Prompt caching across workers uses static_doc:
if static_doc:
header = "# SHARED CONTEXT (cached across all workers)"
system = f"{header}\n{static_doc}\n\n---\n\n{system}"
In the answer-key, explicit cache_control blocks mark the dynamic boundary:
content_blocks = [
{"type": "text", "text": static_doc,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": prompt}, # not cached
]
8. Run It#
SWARM START
Goal: Build a Python CLI that counts words...
[Phase 1] Orchestrator planning... 3 units planned (0ms)
[Phase 2] Spawning 3 workers in parallel... 3/3 done, 0 failed (87ms)
[Phase 3] Adversarial verification... Verdict: PASS (0ms)
SWARM COMPLETE, 89ms total | Cost: $0.000000
The parallelism win in mock mode is small (workers run near-instantly). With real API calls, total swarm time is max(worker_latencies) + orchestrator_latency + verifier_latency, not the sum. Three 8-second workers take ~8 seconds.
Demo 2 (MoA) shows:
DEMO 2: Mixture of Agents (MoA)
Task: Best sorting algorithm for nearly-sorted data?
Workers: 2× Haiku | Aggregator: Haiku
MoA answer: Timsort is optimal (O(n) best case on nearly-sorted).
9. Observe It#
Cost breakdown#
Per swarm run at N=3 on Haiku:
| Call | Input × rate | Output × rate | Subtotal |
|---|---|---|---|
| Orchestrator | 2000 × $0.80/M | 500 × $4.00/M | $0.0036 |
| 3 workers | 1500 × $0.80/M ×3 | 500 × $4.00/M ×3 | $0.0096 |
| Verifier | 3000 × $0.80/M | 500 × $4.00/M | $0.0044 |
| Total | ~$0.018 |
Output tokens: ~500 each × $4.00/M adds more on top. With prompt caching on a large static_doc, per-worker input cost drops 90%: a 10,000-token codebase summary is paid once as cache_write, then each worker reads at 10% cost.
Voting cost: N=5 voters at $0.018 each = $0.09 per voting round. For code review before a production merge, this is trivially cheap. For a development loop that runs 100 reviews per day, that's $9/day, acceptable for a team but worth tracking.
The parallelism win#
Sequential time (N=3, each worker takes T seconds):
Fork-join:
T_parallel = T_orchestrator + max(T_w1, T_w2, T_w3) + T_verifier
≈ T_orchestrator + T_worker + T_verifier
For N=3 with T_worker=8s: sequential=24s vs parallel=8s, 3× speedup. At N=8: sequential=64s vs parallel=8s, 8× speedup. Cost is identical (same total token count). The parallelism is free.
10. Break It#
Run this swarm with 10 units and all-Opus models:
result = await run_swarm(
goal="Build a complete REST API with 10 endpoints...",
model="claude-opus-4-6",
max_workers=10,
)
Do the math:
Orchestrator: 4000 in × $15/M + 2000 out × $75/M = $0.21
10 workers: 10 × (2000 × $15 + 3000 × $75)/M = $2.55
Verifier: 8000 × $15 + 1000 × $75/M = $0.20
Total: ~$2.96 per swarm run
Running 100 swarms/day = ~$300/day. Chapter 07 solves this with routing: Haiku for workers (10× cheaper), Sonnet for orchestrator and verifier, Opus only for units that fail the quality bar. Understand your cost formula before you scale.
Now break Voting deliberately by removing stochasticity. Run N=5 but temperature=0.0 (deterministic), and you get five identical votes. If that path misses an issue, all five voters miss it. (See modules/08_orchestrator_workers/what_goes_wrong.py for the full demo.) True independence requires stochastic sampling or different models.
War Story: The Verifier That Always Passed
A team's verifier passed 98.6% of security patches because its prompt said "be constructive and supportive", the model interpreted this as "don't block." A verifier that always passes is worse than none: it provides false confidence. Instruct verifiers to find problems, not validate, and calibrate against known-bad inputs before trusting them.
11. Exercises#
Exercise 01: Retry Failed (exercises/01_retry_failed.py)#
Implement run_swarm_with_retry(goal, **kwargs) -> SwarmResult. After run_fork_join, collect failed units. Re-run them once in a second gather. Transient errors (rate limits, timeouts, 5xx) cause 2-5% failures in real workloads; a single retry recovers most.
Exercise 02: Status Tracker (exercises/02_status_tracker.py)#
Implement class StatusTracker with mark_running, mark_done, mark_failed, render(). Wrap run_worker to call the tracker before and after. The answer-key's StatusTracker (swarm/batch/tracker.py) uses Rich.
Exercise 03: MoA with Temperature Variation (exercises/03_moa_temperatures.py)#
Run the same model at each temperature in parallel (default [0.0, 0.5, 1.0]). The aggregator synthesizes. Temperature=0.0 gives the same answer every time; temperature=1.0 varies.
Exercise 04: Voting-Based Code Review (exercises/04_voting_review.py)#
Implement a production-grade code review gate using Voting. Test against a diff with a subtle bug (missing input validation, off-by-one, unsanitized SQL). Tune n_voters and k_threshold until the bug is reliably caught without false positives on clean diffs.
12. Summary#
Chapter Takeaways
- Orchestrator-workers traces to MapReduce (Dean & Ghemawat, 2004) and the actor model (Hewitt, 1973). Decompose into independent units, execute in parallel, gather and aggregate. The LLM layer adds dynamic decomposition and natural language as the coordination protocol.
- Five workflow patterns cover the design space: prompt chaining, routing, parallelization (sectioning + voting), orchestrator-workers, evaluator-optimizer. Can you predict the steps? Yes → workflow. No → agent loop.
- Sectioning (Pattern 3a) is the simpler parallelization: fixed deterministic split, no planner LLM. Use when decomposition is obvious.
- Voting (Pattern 3b) runs the same task N times and requires K of N to agree. Reliability is exponential in N when voter errors are uncorrelated. Voting without temperature variation or model diversity is not voting, it's deterministic repetition.
- MoA (Wang et al., 2024) adds an LLM aggregator. The aggregator synthesizes, not just counts. Weaker proposers + strong aggregator can outperform a single strong model.
asyncio.gather(..., return_exceptions=True)is the critical primitive. Without it, one failing coroutine cancels all others. The difference between "one failure kills the swarm" and "failures are contained."- Stale git worktrees are a silent hazard after crashes.
git worktree prune && git worktree listis your cleanup. Add to CI teardown.- Cost formula is
O(1 + N + 1)LLM calls. At N=8 Haiku: ~\(0.04/swarm. At N=8 Opus: ~\)3/swarm. Know your formula before scaling.
Checkpoint: Layer 6 complete. Run the swarm on a task that decomposes into parallel subtasks. Watch the orchestrator plan, workers execute concurrently, verifier check. N=8 Haiku should cost under $0.05. If it completes and the math checks out, you have a working multi-agent system.