Keywords: Diffusion Models, Representation Learning, Image Generation, Self-supervised Learning
TL;DR: We show that established diffusion architectures can be enhanced, by conditioning the decoding process on features learned by themselves. It is surprisingly simple to implement, yet jointly improves both FID and linear accuracy.
Abstract: While diffusion models excel at image synthesis, useful representations have been shown to emerge from generative pre-training, suggesting a path towards unified generative and discriminative learning. However, suboptimal semantic flow within current architectures can hinder this potential: features encoding the richest high-level semantics are underutilized and diluted when propagating through decoding layers, impeding the formation of an explicit semantic bottleneck layer.
To address this, we introduce *self-conditioning*, a lightweight mechanism that reshapes the model's layer-wise semantic hierarchy *without external guidance*. By aggregating and rerouting intermediate features to guide subsequent decoding layers, our method concentrates more high-level semantics, concurrently strengthening global generative guidance and forming more discriminative representations.
This simple approach yields a dual-improvement trend across pixel-space UNet, UViT and latent-space DiT models with minimal overhead. Crucially, it creates an architectural semantic bridge that propagates discriminative improvements into generation and accommodates further techniques such as contrastive *self-distillation*.
Experiments show that our enhanced models, especially self-conditioned DiT, are powerful dual learners that yield strong and transferable representations on image and dense classification tasks, surpassing various generative self-supervised models in linear probing while also improving or maintaining high generation quality.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16715
Loading