Keywords: Diffusion Models, Representation Learning, Image Generation, Self-supervised Learning
TL;DR: We show that established diffusion architectures can be enhanced, by conditioning the decoding process on features learned by themselves. It is surprisingly simple to implement, yet jointly improves both FID and linear accuracy.
Abstract: While diffusion models excel at image synthesis, their generative pre-training has been shown to yield useful representations, paving the way towards unified generative and discriminative learning. However, their potential is hindered by an architectural limitation: the model's intrinsic semantic information flow is potentially sub-optimal. The features encoding the richest high-level semantics are often underutilized and diluted in decoding layers, impeding the formation of a strong representation bottleneck.
To address this, we introduce *self-conditioning*, a lightweight mechanism that reshapes the model's layer-wise semantic hierarchy *without external guidance*. By aggregating and rerouting the richest intermediate features to guide its own decoding layers, our method concentrates more high-level semantics, concurrently strengthening generative guidance and forming a more discriminative representation.
Results are compelling: this approach demonstrates a consistent dual-improvement trend across models and architectures with minimal overhead. Crucially, it creates an architectural semantic bridge that enables an effective integration of other discriminative techniques, such as contrastive *self-distillation*, to further amplify gains. Extensive experiments show that our enhanced models, particularly pixel-space UViT and latent-space DiT, become powerful unified learners, surpassing various self-supervised models in linear evaluation while also improving or maintaining high generation quality.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16715
Loading