Abstract: Large language models exhibit strong capabilities in complex decision-making tasks, driven by their extensive and diverse pretraining corpora. However, attackers have increasingly developed attack methods that induce these models to generate harmful content, raising serious concerns about their safety and robustness. Existing attack methods mostly use single-agent strategies and therefore do not capture the synergistic nature of real-world attacks, where attackers coordinate to hide malicious intent and make detection harder. In this paper, we propose CamouflageAttack, a multi-agent attack framework that jointly improves attack and camouflage effectiveness through synergistic adversarial prompting. Specifically, CamouflageAttack mimics real-world synergistic attack behaviors by coordinating the strategy, camouflage and action agents to generate prompts that evade detection while reliably inducing targeted model responses. The strategy agent proposes candidate prompts to maximize attack success, the camouflage agent refines these prompts to enhance linguistic naturalness and the action agent applies the finalized prompt to execute the attack. Extensive experiments in both offline settings and real-world applications show that CamouflageAttack consistently achieves higher attack success rates and stronger camouflage effectiveness than existing methods. Our code is available at https://github.com/ai4ed/CamouflageAttack.git.
External IDs:doi:10.1109/taslpro.2025.3642728
Loading