Keywords: Multi-Agents, Adversarial Attacks
Abstract: As LLMs become increasingly integrated into complex applications, their vulnerability to adversarial attacks has raised significant concerns. However, existing defenses are reactive in nature. This limitation makes it difficult for them to counter sophisticated threats, as adversaries continuously adjust their strategies across multi-round interactions. In this work, we propose ourmethodLmtt, a novel multi-round defense framework meticulously engineered to counteract sophisticated LLM adversarial attacks across evolving interactions. ourmethod utilizes a cooperative multi-agent system composed of a agentOne, a agentTwo, and a agentForensic. Each agent executes a specialized defense strategy in every round, effectively addressing evolving attacks where the intensity progressively escalates in subsequent interactions. Additionally, a supplemental agentSystem is deployed to coordinate these agents and improve the system’s adaptive capabilities. To facilitate comprehensive evaluation, we present the ourdatasetLmtt dataset designed to simulate evolving strategies across multi-round attacks, including 5,200 adversarial samples categorized into 8 attack types. Experimental results demonstrate that ourmethod achieves a substantial 78.9% reduction in asrAll compared to state-of-the-art defense approaches. Furthermore, ourmethod surpasses existing methods by 186% in drAll and 167.9% in reducing aeAll, offering a deeper and more detailed assessment of defense effectiveness. These findings underscore the potential of ourmethod as a resilient and adaptive defense mechanism for securing LLMs in dynamic and evolving adversarial environments. Our code and dataset are publicly available at [https://anonymous.4open.science/r/TierGurad-0843](https://anonymous.4open.science/r/TierGurad-0843).
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16417
Loading