Reward-guided Meta-Prompt Evolving with Reflection for LLM Jailbreaking

20 Sept 2025 (modified: 04 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language model, Jailbreak Attack, meta-prompt, reflection
Abstract: Large language model (LLM) safety has received extensive attention since LLMs are vulnerable to manipulation. To understand and mitigate the risk, this paper studies the problem of jailbreak attacks on LLMs, which aim to deliberately break the safety guard of LLMs for harmful or unethical responses. Current black-box jailbreak attacks are limited by either a reliance on human expertise for manual prompt design or by intricate workflows in automated approaches. Towards this end, we propose a novel approach named \underline{R}eward-guided Meta-pr\underline{o}mpt Ev\underline{o}lving with reflec\underline{t}ion (ROOT) for automatic jailbreak attack generation. The core idea of our ROOT is to optimize a meta-prompt using attack rewards as jailbreak guidance. In particular, our ROOT feeds a meta-prompt with toxic questions into LLMs to generate prompts for jailbreak attempts. The responses from these attempts are further evaluated using a judge model, which further summarizes the reflection of both successful and unsuccessful jailbreak attempts into different meta-prompt optimization strategies. To reduce the noise, we estimate their reward score to select high-quality strategies, which can optimize our meta-prompt for better attack generation ability. Extensive experiments show that ROOT demonstrates strong generalizability and broad adaptability by achieving high jailbreak success rates (above 90\%) across both multiple LLMs and various categories of harmful tasks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23574
Loading