Keywords: multi-agent debate, MAD, large language models, test-time scaling, reasoning, safety
TL;DR: We analyze multi-agent debate (MAD) as a test-time scaling method, revealing when it helps or harms compared to self-agent approaches in mathematical reasoning and safety tasks.
Abstract: The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving.
These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models.
Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to single-agent methods, particularly under varying conditions, remains elusive.
This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities.
We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks.
Our study systematically examines the influence of task types, task difficulty, and agent diversity on MAD's performance.
Key findings reveal that, for mathematical reasoning, MAD provides limited advantages over self-agent scaling, even with diverse agents, though it becomes slightly more effective as problem difficulty increases.
Conversely, for safety tasks, MAD’s collaborative refinement generally strengthens defense as more agents are added. Additionally, incorporating diverse agent configurations yields a more pronounced reduction in attack success through collaborative refinement.
We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 15213
Loading