When Consensus Is Not Correctness: Diversity Collapse and Manufactured Overconfidence in Multi-Agent LLM Debate

When Consensus Is Not Correctness: Diversity Collapse and Manufactured Overconfidence in Multi-Agent LLM Debate

TMLR Paper9844 Authors

18 Jun 2026 (modified: 20 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multi-agent large language model (LLM) debate is widely believed to improve answers, and agreement is routinely read as evidence of trustworthiness. We show that debate transforms agreement from evidence into an outcome: agreement is endogenous to the interaction that produces it. A variance account, tied to measured diversity collapse, makes this precise. As agents read one another, the inter-agent correlation rises toward one. That same correlation controls both the panel's error and the disagreement an operator reads as confidence, driving them in opposite directions: the ensemble stops averaging out error exactly as it stops looking uncertain. Three consequences follow. Apparent confidence saturates independently of error. The central empirical signature is not the identity G = C - A = R - (1 - C), but the collapse-induced flattening of the confidence shortfall 1 - C: terminal confidence has 17x smaller variance than accuracy, and one shared shortfall predicts per-condition gaps out of sample. Consistent with this signature, the induced gap-residual regression is affine on the primary model, with slope 0.82 and R² = 0.96, and remains monotone with sub-unit slopes across model-family probes. Whether debate is benign is then a race between error correction and confidence inflation, governed by role design and task headroom. We introduce Calibrated Multi-Agent Debate, a certification-first framework with two conditional levers, Prevent and Detect, and a split-conformal certificate, Certify. With exchangeable labeled calibration, Certify controls set coverage under collapse, at the cost of larger sets or abstention on hard cases, while agreement-based stopping commits confident errors at 18-47% miscoverage. Matched self-critique and verdict-injection controls separate interaction-driven amplification from baseline model overconfidence.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Peng_Li2

Submission Number: 9844

Loading