Multi-Agent Debate#
A decision-support agent that poses an engineering question to a panel of role-biased debaters, has them critique each other, and returns a verdict with an audit-ready evidence table.
Problem#
Important engineering decisions ("should we migrate from Postgres to DynamoDB?", "rewrite in Go?") rarely have a single right answer. They are trade-offs dressed as questions. A single LLM call tends to pick a side and write confident prose either way; you get an answer that sounds good and discover the holes in a post-mortem six months later.
This system forces the model to consider the decision from multiple perspectives simultaneously. You get:
- A panel of 3 debaters, each with a distinct persona (database expert, cost analyst, risk manager, etc.) whose reasoning is biased by a system prompt.
- Parallel openings, parallel critiques —
asyncio.gathereverywhere. - A consensus detector that tells you honestly whether the panel converged.
- A moderator synthesis that is NOT allowed to invent claims the debaters did not make.
- An evidence table so you can audit which role made which claim.
Architecture#
question
|
v
roles_for_question(question, n=3) (deterministic persona selector)
|
v
Debater × N ---- opening (parallel gather)
| |
v v
ConsensusDetector <-- latest positions
|
[not converged?]
v
Debater × N ---- critique (parallel gather)
| |
v v
ConsensusDetector <-- latest positions
|
[converged or max rounds]
v
Moderator.synthesize(question, transcript)
|
v
EvidenceTable.build_table(all_positions)
|
v
DebateResult
Why three layers?#
- Debater. One model, one role. Keeps persona drift minimal.
- ConsensusDetector. Python, not LLM. The moderator cannot overrule it, which means the moderator cannot claim false consensus.
- Moderator. Writes the synthesis paragraph. Has access to the entire transcript but no ability to insert new claims.
Cost estimate#
At default settings (Claude Haiku, 3 debaters, 2 rounds):
- Openings: 3 calls × ~200 in / 90 out = 870 tokens out total.
- Critiques: 3 calls × ~250 in / 70 out = 210 tokens out total.
- Synthesis: 1 call × ~320 in / 85 out.
- Haiku 4.5: ~$0.0015 per debate.
- Sonnet 4.6: ~$0.015 per debate.
- Opus 4.7: ~$0.15 per debate.
Scales linearly in panel size. A five-debater panel doubles the cost of the opening and critique rounds.
Failure modes#
Every multi-agent system fails in predictable ways. Know them.
- Echo chamber. All personas lean the same way because the selector
picked too-similar roles. Mitigation:
roles_for_questionalways includes at least one skeptic (performance_skeptic,risk_manager) or pragmatist. If you overrideroles=, ensure a diverse mix. - Infinite disagreement. Debaters keep restating their openings. The
consensus detector catches this as
incoherentand the orchestrator stops afterroundsis exhausted. - False consensus. Personas converge on a stance for mutually
inconsistent reasons. Mitigation: the evidence table flags whether a
claim carries an evidence marker (
%,p95,$/month, etc.). If the consensus is all opinion, treat it with suspicion. - Hallucinated evidence. Debaters are instructed not to invent sources,
but they still can. The evidence table's
has_evidence_markercolumn lets a human skim for numbers that look made up.
Running#
SWARM_MOCK=true python -m projects.multi_agent_debate.agent \
"Should we migrate from Postgres to DynamoDB?"
With a real API key the same command runs on Claude Haiku by default.
Override with --model, --panel-size, or --rounds.
Tests#
The test suite exercises:
- Stance detection on representative language patterns.
- Consensus detector verdicts for consensus / split / incoherent.
- The full pipeline under mock mode (consensus in 2 rounds, split reported honestly, evidence table has one row per extracted claim, moderator does not fabricate claims).