RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Jailbreak Attack, Adversarial Attack
TL;DR: propose refusal-aware loss to find optimal adversarial suffix embedding and a critic-guided decoding stage to enhance coherence of the adversarial suffix
Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks but remain vulnerable to jailbreak attacks that bypass safety mechanisms. We propose RAID (Refusal-Aware and Integrated Decoding), a jailbreak framework that crafts adversarial suffixes capable of inducing harmful outputs while preserving fluency and naturalness. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint loss combining three components: (i) an attack objective that maximizes harmful responses, (ii) a refusal-aware regularizer that steers suffixes away from refusal directions in embedding space, and (iii) a coherence loss that enforces fluency, semantic plausibility, and non-redundancy. After optimization, suffix embeddings are mapped back to tokens using critic-guided decoding, which balances embedding affinity with language model likelihood. This integrated design produces suffixes that are both effective in bypassing defenses and natural in form. Extensive experiments on state-of-the-art LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational overhead than strong baselines in the single-instance setting. Our results highlight the critical role of embedding-space regularization and decoding strategies in advancing the study of jailbreak vulnerabilities and defenses.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10763
Loading