ConFreeze: Selective Multi-Model Debate through Consensus Freezing

ACL ARR 2026 January Submission10701 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-model debate, robustness, consensus-based gating, LLM
Abstract: Multi-model debate enhances large language model reasoning but suffers from prohibitive computational costs and instability risks, where excessive deliberation can overturn correct initial consensus. However, existing research has primarily focused on performance gains, while the efficiency and stability implications of iterative debate remain underexplored. To address these limitations, we formulate Multi-Model Debate as a decision control problem and propose \textit{ConFreeze}, a selective execution mechanism that uses initial vote patterns as a gating signal. When models unanimously agree in the initial round, we freeze the consensus to avoid computational waste and instability risk. When models disagree, we trigger a subsequent round of collaborative refinement where models critique and revise predictions. This allocates debate budget where reasoning conflicts signal improvement potential. To better reflect robustness and comprehensively evaluate debate dynamics, we evaluate not only end-task quality but also stability measures(flip rate, improve/worsen rate) together with token cost. Experiments on ANLI, AdvGLUE, and TruthfulQA demonstrate that \textit{ConFreeze} achieves 29.5\%-43.1\% token reduction while maintaining accuracy. Our findings reveal that debate benefits are concentrated almost exclusively in disputed instances, validating initial consensus as a reliable signal for efficient inference control.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 10701
Loading