The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies

18 Sept 2025 (modified: 30 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Trojan, Adversarial Attacks, Multi-agent Systems, Evolutionary Algorithm
Abstract: While discussions on Large Language Models (LLMs) safety have largely centered on single-agent settings, the increasing integrations of LLMs into Multi-Agent Systems (MAS) introduce novel risks. These systems, where behavior emerges from inter-agent communication, become vulnerable to maliciously modified LLMs $\textit{e.g.,}$ trojans, especially when models within the systems are sourced from public repositories or accessed as black-box APIs, precluding direct weight analysis. This paper introduces $\textbf{Conversational Trojan Unmasking System}$, deemed CTUS, an Evolutionary Algorithm (EA) based framework designed to address this critical challenge. CTUS functions as a pre-deployment screening tool, $\emph{enabling a designated judge agent to automatically evolve conversational strategies to detect hidden threats within a simulated MAS environment}$. The methodology's core lies in optimizing these conversational strategies based on their success in provoking and revealing trojan-like responses from other LLMs. This allows for the $\emph{discovery of nuanced, indirect probing techniques}$ that are difficult to find with static methods. Evaluating CTUS across prominent LLMs, including $\texttt{Llama-2}$, $\texttt{Llama-3}$, $\texttt{Gemma}$, and $\texttt{Mistral}$, we demonstrate its effectiveness in uncovering hidden trojans. Our work also studies the impact of different trojan attack methods, the number of benign and trojan agents within the MAS, and potential biases from different judge agent who are responsible for detecting any trojan-like behavior, thereby affirming the robustness of CTUS.
Primary Area: interpretability and explainable AI
Submission Number: 10360
Loading