This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

ICLR 2026 Conference Submission20271 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of Agents, Deception, Robustness, Large Language Models
Abstract: Multi-agent systems of large language models (LLMs) operate with the assumption that all the agents in the system are trustworthy. In this paper, we investigate the robustness of multi-agent LLM systems against intrusions by malicious agents, using the Mixture of Agents (MoA; Wang et al., 2024) as a representative multi-agent architecture. We evaluate its robustness by red-teaming it with carefully crafted instructions designed to deceive the other agents. When tested on standard benchmarks, including AlpacaEval, our investigation reveals that the performance of MoA can be severely compromised by the presence of even a single malicious agent, which can nullify the benefits of having multiple agents. The performance degradation becomes more severe as the capability of the malicious agent increases. On the other hand, naive measures, such as increasing the number of agents or replacing faithful agents with stronger models, are insufficient to defend against such intrusions. As a preliminary step toward addressing this risk, we explore a range of unsupervised defense mechanisms that recover most of the lost performance with affordable computational overhead. Our work highlights the security risks associated with multi-agent LLM systems and underscores the need for robust and efficient defense mechanisms.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20271
Loading