Studying Coordination and Collusion in Multi-Agent LLM Code Reviews

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0
Keywords: AI agents, LLMs, Collusion, Cooperation, Coding, Code-review, Backdoor, Control, AI Safety, Multi-agent systems, Attacks
TL;DR: We show that when incentivised, most frontier LLMs collude in multi-agent code review by submitting backdoored code and routing to allies, with GPT-5 as the exception, underscoring the need for coordination-aware oversight.
Abstract: Agentic large language models (LLMs) are rapidly moving from single-assistant tools to collaborative systems that write and review code, creating new failure modes, as agents may coordinate to subvert oversight. We study whether such systems exhibit coordination behaviour that enables backdoored code to pass peer- review, and how these behaviours vary across seven frontier models with minimal coordination scaffolding. Six of seven models exploited the backdoor incentive, submitting functionally impaired code in 34.9-75.9% of attempts across 10 rounds of our simulation spanning 90 seeds. Whilst GPT-5 largely refused (≤10%), models across GPT, Gemini and Claude model families preferentially requested reviews from other saboteurs (29.2–38.5% vs 20% random), indicating possible selective coordination capabilities. Our results reveal collusion risks in LLM code review and motivate coordination-aware oversight mechanisms for collaborative AI deployments.
Submission Number: 232
Loading