Skip to content

Multi-Agent Debate#

A decision-support agent that poses an engineering question to a panel of role-biased debaters, has them critique each other, and returns a verdict with an audit-ready evidence table.

Problem#

Important engineering decisions ("should we migrate from Postgres to DynamoDB?", "rewrite in Go?") rarely have a single right answer. They are trade-offs dressed as questions. A single LLM call tends to pick a side and write confident prose either way; you get an answer that sounds good and discover the holes in a post-mortem six months later.

This system forces the model to consider the decision from multiple perspectives simultaneously. You get:

  • A panel of 3 debaters, each with a distinct persona (database expert, cost analyst, risk manager, etc.) whose reasoning is biased by a system prompt.
  • Parallel openings, parallel critiques — asyncio.gather everywhere.
  • A consensus detector that tells you honestly whether the panel converged.
  • A moderator synthesis that is NOT allowed to invent claims the debaters did not make.
  • An evidence table so you can audit which role made which claim.

Architecture#

  question
     |
     v
  roles_for_question(question, n=3)        (deterministic persona selector)
     |
     v
  Debater × N ---- opening (parallel gather)
     |                        |
     v                        v
  ConsensusDetector  <-- latest positions
     |
     [not converged?]
     v
  Debater × N ---- critique (parallel gather)
     |                        |
     v                        v
  ConsensusDetector  <-- latest positions
     |
     [converged or max rounds]
     v
  Moderator.synthesize(question, transcript)
     |
     v
  EvidenceTable.build_table(all_positions)
     |
     v
  DebateResult

Why three layers?#

  • Debater. One model, one role. Keeps persona drift minimal.
  • ConsensusDetector. Python, not LLM. The moderator cannot overrule it, which means the moderator cannot claim false consensus.
  • Moderator. Writes the synthesis paragraph. Has access to the entire transcript but no ability to insert new claims.

Cost estimate#

At default settings (Claude Haiku, 3 debaters, 2 rounds):

  • Openings: 3 calls × ~200 in / 90 out = 870 tokens out total.
  • Critiques: 3 calls × ~250 in / 70 out = 210 tokens out total.
  • Synthesis: 1 call × ~320 in / 85 out.
  • Haiku 4.5: ~$0.0015 per debate.
  • Sonnet 4.6: ~$0.015 per debate.
  • Opus 4.7: ~$0.15 per debate.

Scales linearly in panel size. A five-debater panel doubles the cost of the opening and critique rounds.

Failure modes#

Every multi-agent system fails in predictable ways. Know them.

  • Echo chamber. All personas lean the same way because the selector picked too-similar roles. Mitigation: roles_for_question always includes at least one skeptic (performance_skeptic, risk_manager) or pragmatist. If you override roles=, ensure a diverse mix.
  • Infinite disagreement. Debaters keep restating their openings. The consensus detector catches this as incoherent and the orchestrator stops after rounds is exhausted.
  • False consensus. Personas converge on a stance for mutually inconsistent reasons. Mitigation: the evidence table flags whether a claim carries an evidence marker (%, p95, $/month, etc.). If the consensus is all opinion, treat it with suspicion.
  • Hallucinated evidence. Debaters are instructed not to invent sources, but they still can. The evidence table's has_evidence_marker column lets a human skim for numbers that look made up.

Running#

SWARM_MOCK=true python -m projects.multi_agent_debate.agent \
    "Should we migrate from Postgres to DynamoDB?"

With a real API key the same command runs on Claude Haiku by default. Override with --model, --panel-size, or --rounds.

Tests#

SWARM_MOCK=true .venv/bin/pytest projects/multi_agent_debate/ -v

The test suite exercises:

  • Stance detection on representative language patterns.
  • Consensus detector verdicts for consensus / split / incoherent.
  • The full pipeline under mock mode (consensus in 2 rounds, split reported honestly, evidence table has one row per extracted claim, moderator does not fabricate claims).