Adaptive Strategy Evolution for Generating Tailored Jailbreak Prompts against Black-Box Safety-Aligned LLMs
Keywords: Strategy evolution, Black-box jailbreak, Safety-aligned LLM
TL;DR: We propsoe a novel black-box jailbreak method, which have successfeully jailbreaked Llama3, GPT-4o, Claude-3.5 and even o1.
Abstract: While safety-aligned Large Language Models (LLMs) have been secured by extensive alignment with human feedback, they remain vulnerable to jailbreak attacks that exploit prompt manipulation to generate harmful outputs. Investigating these jailbreak methods, particularly in black-box scenarios, allows us to explore the inherent limitations of such LLMs and provides insights into possible improvements. However, existing black-box jailbreak methods either overly rely on red-teaming LLMs to execute sophisticated reasoning tasks, such as diagnosing failure cases, determining improvement directions, and rewriting prompts, which pushes them beyond their inherent capabilities and introduces uncertainty and inefficiency into the refinement process, or they are confined to rigid, manually predefined strategy spaces, limiting their performance ceiling. To enable a sustained and deterministic exploration with clear directional guidance, we propose the novel Adaptive Strategy Evolution (ASE) framework. Specifically, ASE innovatively decomposes jailbreak strategies into modular key components, dramatically enhancing both the flexibility and expansiveness of the strategy space. This also allows us to shift focus from directly optimizing prompts to optimizing the jailbreak strategies. Then, by leveraging a genetic algorithm (GA) for strategy components' selection and mutation, ASE could replace the uncertainties of LLM-based self-adjustment with a more systematic and deterministic optimization process. Additionally, we have also designed a new fitness evaluation, that emphasizes the independence of scoring criteria, provides highly accurate and reliable feedback, enabling precise and targeted refinement of jailbreak strategies. Experimental results further demonstrate that ASE achieves superior jailbreak success rates (JSR) compared to existing state-of-the-art methods, especially against the most advanced safety-aligned LLMs like GPT-4o, Claude-3.5, and even o1.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9593
Loading