Keywords: Safety Alignment, Diffussion Large Language Model
TL;DR: We identify and theoretically explain mask-based jailbreak vulnerabilities in diffusion LLMs, and propose a Reject-MASK defense that reduces attack success from over 90% to single digits while preserving utility.
Abstract: Diffusion large language models extend diffusion process to discrete domains such as text, demonstrating strong performance in many tasks.
However, their bidirectional and parallel decoding architecture introduce unique safety risks that bypass existing safeguards.
We show that dLLMs are highly vulnerable to **MASK**-based jailbreaks, where adversarial prompts exploit masked tokens to get fluent but unsafe completions.
Through rigorous theoretical analysis and formal proofs, we identify margin accumulation and scheduling advantages as fundamental causes of this vulnerability.
To address these risks, we introduce a two-stage data synthesis framework together with a Reject-MASK training strategy.
Experimental results demonstrate that our approach consistently suppresses attack success rates from above 90\% to nearly single-digit levels, while retaining competitive utility across diverse benchmarks.
By grounding defense design in rigorous theoretical analysis, our work not only establishes a principled foundation for the safety of diffusion-based large language models, but also provides a scalable and practical alignment framework that advances their secure deployment in real-world applications.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10906
Loading