Keywords: multi-agent debate, MAD, large language models, test-time scaling, reasoning, safety
TL;DR: We analyze multi-agent debate (MAD) as a test-time scaling method, revealing when it helps or harms compared to self-agent approaches in mathematical reasoning and safety tasks.
Abstract: The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to single-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on solution-finding tasks (e.g., mathematical reasoning) and response-judging tasks (e.g., safety). Our study systematically examines the influence of task type, task difficulty, and agent diversity on MAD’s performance. Our key findings reveal that, for solution-finding tasks, MAD offers only limited advantages over self-agent scaling—even with diverse agents—although its effectiveness increases slightly as problem difficulty rises. Conversely, for response-judging tasks, especially on safety-reasoning tasks, MAD’s collaborative refinement generally strengthens defense and judgment as more agents are added. Moreover, incorporating diverse agent configurations yields a more pronounced reduction in attack success, indicating that agent diversity is crucial for response-judging tasks, unlike in solution-finding tasks. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 15213
Loading