Keywords: dlm, diffusion-language-model, activation-dynamics
TL;DR: LLaDA-8B contains a super outlier, model produces nonsense if this is removed
Abstract: Diffusion language models (DLMs) have emerged as competitive alternatives to autoregressive (AR) language models, yet their activation dynamics remain poorly understood.
We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier that persists across all token positions. Despite its apparent redundancy, pruning this outlier collapses the model into repetitive token loops.
Apart from this outlier, LLaDA-8B is more redundant than comparable AR models, with redundancy concentrated in earlier layers, the reverse of the AR pattern, where deeper layers are usually more redundant due to undertraining. Weight spectral analysis attributes this to relative overtraining of early layers, and a controlled 160M AR/DLM pre-training pair reproduces the pattern, isolating the diffusion objective as the cause.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 188
Loading