UniRA: Unified Representation Alignment for Diffusion Models via Local, Structural, and Global Constraints
Keywords: Diffusion models, Representation alignment, Generative modeling, Semantic fidelity, Structural consistency
Abstract: Diffusion models have achieved tremendous advancements in generative modeling generation, enabling appealing experiences in visual content generation. Yet, their conventional training objective focuses merely on predicting added noises, without any explicit consideration on the learning of intermediate features. This narrow focus might learn redundant representations that capture limited semantics and poor structural details, thus leading to suboptimal performance. To ameliorate this, this paper proposes a unified representation alignment (UniRA) paradigm that augments the diffusion objective with explicit constraints on enhancing intermediate features. Specifically, UniRA enforces three complementary forms of alignment: local semantic fidelity for discriminative patch-level features, structural consistency to preserve relational organization, and global coherence to match overall feature distributions with real data. Extensive results on the challenging ImageNet and text-to-image benchmarks show that UniRA consistently improves convergence speed and synthesis performance, gaining improved FID and precision/recall scores under the same compute budget with compared baselines. Moreover, ablative analysis demonstrate the efficacy of UniRA in reducing feature redundancy and strengthening semantic information, and improving structural organization, thereby promoting high-quality synthesis.
Primary Area: generative models
Submission Number: 8649
Loading