CCR: A Continuous Composite Reward for Efficient Reinforcement Learning-Based Jailbreak Attacks

CCR: A Continuous Composite Reward for Efficient Reinforcement Learning-Based Jailbreak Attacks

ICLR 2026 Conference Submission20323 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Jailbreak Attack, Black-box Adversarial Text Generation, Continuous Composite Reward (CCR), ASR-G

TL;DR: We propose a reinforcement learning–based jailbreak attack with a continuous composite reward for stable black-box optimization.

Abstract: Jailbreak techniques for large language models (LLMs) have primarily relied on gradient-based optimization, which requires white-box access, and black-box evolutionary search, which suffers from slow convergence. In this work, we propose a reinforcement learning (RL) framework that formalizes jailbreak generation as a sequential decision-making problem, leveraging black-box model feedback to enable optimization without gradient access. The key to this framework is the Continuous Composite Reward (CCR), a task-oriented reward tailored for adversarial text generation. CCR provides dense feedback along two complementary dimensions: at the lexical level, it discourages refusal outputs and steers generation toward target responses; at the semantic level, it aligns outputs with multiple anchors to maintain topical relevance and format consistency. This design enables stable training under noisy black-box conditions and improves robustness to model updates. Consequently, the attack model transfers effectively across both open-source and API-served targets without model-specific finetuning. We also propose a stricter evaluation metric, ASR-G, which combines content-level matching with Llama Guard filtering to more reliably measure jailbreak success. On LLaMA-2, our method achieves attack success rates that exceed COLD-Attack and PAL by 17.64 and 50.07 percentage points, respectively. These results highlight the effectiveness and cross-model transferability of our approach under fully black-box conditions while reducing query costs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20323

Loading