Track: Tiny Paper Track (Page limit: 3-5 pages)
Keywords: diffusion transformers, high norm activations, registers, attention sinks
Abstract: Diffusion Transformers (DiTs) have recently replaced U-Net backbones as the dominant architecture in state-of-the-art text-to-image generative models, achieving remarkable visual fidelity. However, their internal mechanisms remain largely unexplored. In this work, we investigate the emergence of high-norm activations within DiTs—tokens with unusually large magnitudes that resemble the “outlier” tokens previously identified in Vision Transformers (ViTs). Through a systematic analysis of four DiT architectures, we find that only Flux-Schnell and PixArt-sigma exhibit such activations in the image stream, primarily concentrated in the central transformer layers. Using linear probes and qualitative ablations, we show that these activations encode global or semantic image information, while their removal has negligible effect on the generation process. We refer to these as sink registers, reflecting their passive, semantic role. Our findings highlight an architectural divergence between ViTs and DiTs, and contribute to a deeper interpretability of diffusion-based generative models.
Submission Number: 49
Loading