Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Massive Activations, Diffusion Transformers, Visual Detail Synthesis
Abstract: Massive Activations (MAs) are a well-documented phenomenon across Transformer architectures, and prior studies in both LLMs and ViTs have shown that they play a substantial role in shaping model behavior. However, the nature and function of MAs within Diffusion Transformers (DiTs) remain largely unexplored. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose Detail Guidance (DG), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling joint enhancement of detail fidelity and prompt alignment. Extensive experiments demonstrate that our DG consistently improves local detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
Primary Area: generative models
Submission Number: 5020
Loading