Multi-Agent Debate#

A decision-support agent that poses an engineering question to a panel of role-biased debaters, has them critique each other, and returns a verdict with an audit-ready evidence table.

Problem#

Important engineering decisions ("should we migrate from Postgres to DynamoDB?", "rewrite in Go?") rarely have a single right answer. They are trade-offs dressed as questions. A single LLM call tends to pick a side and write confident prose either way; you get an answer that sounds good and discover the holes in a post-mortem six months later.

This system forces the model to consider the decision from multiple perspectives simultaneously. You get:

A panel of 3 debaters, each with a distinct persona (database expert, cost analyst, risk manager, etc.) whose reasoning is biased by a system prompt.
Parallel openings, parallel critiques — asyncio.gather everywhere.
A consensus detector that tells you honestly whether the panel converged.
A moderator synthesis that is NOT allowed to invent claims the debaters did not make.
An evidence table so you can audit which role made which claim.

Architecture#

  question
     |
     v
  roles_for_question(question, n=3)        (deterministic persona selector)
     |
     v
  Debater × N ---- opening (parallel gather)
     |                        |
     v                        v
  ConsensusDetector  <-- latest positions
     |
     [not converged?]
     v
  Debater × N ---- critique (parallel gather)
     |                        |
     v                        v
  ConsensusDetector  <-- latest positions
     |
     [converged or max rounds]
     v
  Moderator.synthesize(question, transcript)
     |
     v
  EvidenceTable.build_table(all_positions)
     |
     v
  DebateResult

Why three layers?#

Debater. One model, one role. Keeps persona drift minimal.
ConsensusDetector. Python, not LLM. The moderator cannot overrule it, which means the moderator cannot claim false consensus.
Moderator. Writes the synthesis paragraph. Has access to the entire transcript but no ability to insert new claims.

Cost estimate#

At default settings (Claude Haiku, 3 debaters, 2 rounds):

Openings: 3 calls × ~200 in / 90 out = 870 tokens out total.
Critiques: 3 calls × ~250 in / 70 out = 210 tokens out total.
Synthesis: 1 call × ~320 in / 85 out.
Haiku 4.5: ~$0.0015 per debate.
Sonnet 4.6: ~$0.015 per debate.
Opus 4.7: ~$0.15 per debate.

Scales linearly in panel size. A five-debater panel doubles the cost of the opening and critique rounds.

Failure modes#

Every multi-agent system fails in predictable ways. Know them.

Echo chamber. All personas lean the same way because the selector picked too-similar roles. Mitigation: roles_for_question always includes at least one skeptic (performance_skeptic, risk_manager) or pragmatist. If you override roles=, ensure a diverse mix.
Infinite disagreement. Debaters keep restating their openings. The consensus detector catches this as incoherent and the orchestrator stops after rounds is exhausted.
False consensus. Personas converge on a stance for mutually inconsistent reasons. Mitigation: the evidence table flags whether a claim carries an evidence marker (%, p95, $/month, etc.). If the consensus is all opinion, treat it with suspicion.
Hallucinated evidence. Debaters are instructed not to invent sources, but they still can. The evidence table's has_evidence_marker column lets a human skim for numbers that look made up.

Running#

SWARM_MOCK=true python -m projects.multi_agent_debate.agent \
    "Should we migrate from Postgres to DynamoDB?"

With a real API key the same command runs on Claude Haiku by default. Override with --model, --panel-size, or --rounds.

Tests#

SWARM_MOCK=true .venv/bin/pytest projects/multi_agent_debate/ -v

The test suite exercises:

Stance detection on representative language patterns.
Consensus detector verdicts for consensus / split / incoherent.
The full pipeline under mock mode (consensus in 2 rounds, split reported honestly, evidence table has one row per extracted claim, moderator does not fabricate claims).