Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability for AI Safety, Applications of interpretability
TL;DR: We study the refusal dynamics of AR and DLMs during generation, and introduce the Step-Wise Refusal Internal Dynamics (SRI) signal, a tool for interpretable analysis and efficient detection of safety failure cases.
Abstract: Diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models, offering parallel decoding and controllable sampling dynamics while achieving competitive generation quality at scale. Despite this progress, the role of sampling mechanisms in shaping refusal behavior and jailbreak robustness remains poorly understood. In this work, we present an empirical study of step-wise refusal dynamics, examining the role of AR and diffusion sampling from a safety perspective. Our results strongly indicate that the sampling strategy (diffusion vs.\ AR) plays a central role in safety behavior, acting as a factor distinct from the underlying learned representations. To go beyond text-level analysis and provide interpretability, we introduce the Step-Wise Refusal Internal Dynamics (SRI) signal, which enables the analysis of safety failures (harmful generations), including cases of \emph{incomplete internal recovery} that are not observable at the text level. We further show that SRI leads to improved safety by enabling the construction of an inference-time jailbreak detector that generalizes to unseen attacks and achieves competitive state-of-the-art detection performance, while requiring over $100\times$ lower inference overhead compared to existing defenses.
Submission Number: 349
Loading