Multi-step Adaptive Attack Agent: A Dynamic Approach for Jailbreaking Large Language Models

ACL ARR 2025 February Submission2756 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have showcased remarkable potential across various domains, especially in text generation. However, their vulnerability to jailbreak attacks presents considerable challenges to secure deployment, as attackers can use carefully crafted prompts to bypass safety measures and generate harmful content. Current jailbreak methods generally suffer from two significant limitations: a restricted strategy space for generating adversarial prompts and insufficient optimization of prompts based on feedback from LLMs. To overcome these challenges, we present Multi-step Adaptive Attack Agent (MATA), an approach that employs a game-theoretic interaction between attack agent and target model to adaptively execute jailbreak attacks on LLMs. This method enables iterative attempts based on reflection, gradually identifying the optimal jailbreak attack strategy within a complex strategy space. We compared MATA with mainstream methods across multiple open-source and closed-source LLMs, including Llama3.1, GLM4, and GPT4o. The results demonstrate that our approach exceeds existing methods in terms of attack success rate, average number of queries, and prompt diversity, effectively identifying vulnerabilities in LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking; evaluation methodologies; evaluation;
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: Chinese, English
Submission Number: 2756
Loading