Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Transformer, Magnitude Preservation, Condition Modulation
Abstract: Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This works interprets magnitude preservation as encoding the known \emph{second moment} (unit variance) of the denoising target and ask whether this statistical prior can stabilize Diffusion Transformers (DiTs) with AdaLN conditioning. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8\%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4\% fewer parameters. This work provides insights into conditioning strategies and magnitude control. Implementation available at \href{https://github.com/ericbill21/map-dit}{https://github.com/ericbill21/map-dit}.
Submission Number: 27
Loading