Self-Improvement of Language Models by Post-Training on Multi-Agent Debate

Published: 02 Mar 2026, Last Modified: 02 Mar 2026MALGAIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-improvement, multi-agent debate, language models, reasoning, self-consistency, reinforcement learning, self-training
TL;DR: We use multi-agent debate as a training signal for self-improvement, teaching language models through RL to better leverage the debate setting, while improving their reasoning accuracy and consistency.
Abstract: Self-improvement, where models improve beyond their current performance without external supervision, remains a challenge. The core difficulty is sourcing a training signal stronger than what the model itself can currently produce. Majority voting has been shown to provide such a signal by aggregating over multiple samples, helping mitigate some of the inconsistencies in LM reasoning. In this work, we show that multi-agent debate—where models collaborate and exchange reasoning over multiple rounds—provides an even richer signal than single-round majority voting. We introduce Multi-Agent Consensus Alignment (MACA), which uses reinforcement learning (RL) to post-train models to effectively utilize multi-agent debate. We find that preference learning over full reasoning traces, learning to differentiate between majority and minority reasoning, is more effective than binary consensus rewards or SFT-based approaches for leveraging these debate signals. This produces three key improvements: models are (1) better at utilizing the multi-agent debate setting (+26.87% on MATH), (2) individually more accurate (+21.51% on MathQA), and (3) more self-consistent (+27.6% on GSM8K). We also see strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA).
Submission Number: 90
Loading