Early Guidance, Late Convergence: Hidden‑State Massive Values in Diffusion MLLMs

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion large language model, massive values, hidden states
Abstract: Diffusion multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive generation, offering multi-token updates for finer control and faster inference. However, their internal mechanisms remain poorly understood, especially regarding how hidden states evolve across network layers and iterative denoising steps. In this work, we present the first systematic investigation of a striking phenomenon in dMLLMs: a small fraction of hidden state activations become extraordinarily large and consistently appear across layers and timesteps. We refer to this phenomenon as massive values. Our analysis reveals that in later layers and final diffusion steps, massive values align closely with the model’s output semantics and confidence, directly influencing generation quality. In contrast, early layers and initial noisy steps produce massive values that are necessary to initiate generation and guide the global structure of content. Furthermore, using a sparse autoencoder to interpret hidden representations, we find that the evolution of these high-magnitude activations closely tracks the formation of output semantics. This indicates that the massive values are not just numerical outliers but are crucial drivers of the model’s semantic generation process. Overall, our findings shed new light on the inner workings of dMLLMs and suggest potential strategies to improve their reliability and performance.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8480
Loading