RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen; John Le; Thai T. VU; Willy Susilo; Heath John Cooper

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen, John Le, Thai T. VU, Willy Susilo, Heath John Cooper

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Jailbreak Attack, Adversarial Attack

TL;DR: propose refusal-aware loss to find optimal adversarial suffix embedding and a critic-guided decoding stage to enhance coherence of the adversarial suffix

Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks but remain vulnerable to jailbreak attacks that bypass safety mechanisms. We propose RAID (Refusal-Aware and Integrated Decoding), a jailbreak framework that crafts adversarial suffixes capable of inducing harmful outputs while preserving fluency and naturalness. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint loss combining three components: (i) an attack objective that maximizes harmful responses, (ii) a refusal-aware regularizer that steers suffixes away from refusal directions in embedding space, and (iii) a coherence loss that enforces fluency, semantic plausibility, and non-redundancy. After optimization, suffix embeddings are mapped back to tokens using critic-guided decoding, which balances embedding affinity with language model likelihood. This integrated design produces suffixes that are both effective in bypassing defenses and natural in form. Extensive experiments on state-of-the-art LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational overhead than strong baselines in the single-instance setting. Our results highlight the critical role of embedding-space regularization and decoding strategies in advancing the study of jailbreak vulnerabilities and defenses.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 10763

Loading