Beyond the Prompt: Leveraging Pre-Decoding States for Jailbreak Detection in dLLMs

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion language models, llm safety, jailbreak detection
Abstract: Diffusion language models (dLLMs) generate text by iteratively denoising masked response positions, exposing hidden states over future response slots before any token is finalized. This creates a detection surface that is absent from standard autoregressive decoding: even when a jailbreak is difficult to identify from the prompt alone, the initial masked response states may already reflect the model's emerging completion. We test this hypothesis on LLaDA-8B-Instruct by training lightweight linear classifiers on two frozen representations: prompt hidden states and pre-decoding masked-response hidden states. Empirically, the two views are complementary: neither classifier strictly dominates the other, and each recovers attacks missed by the other view. We then introduce $\texttt{ReFuse}$ (Representation Fusion), an inference-time detector that fuses prompt and pre-decoding response classifier scores without modifying model weights or the decoding procedure. Across transferred and dLLM-targeted jailbreaks, $\texttt{ReFuse}$ reduces average ASR from 63.29\% for the undefended model and 11.27\% for a prompt-only classifier to 3.31\%, while keeping average benign refusal on standard utility benchmarks below 1\%. These results suggest that pre-decoding response states provide a complementary safety signal for detecting jailbreaks in dLLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 432
Loading