Keywords: Diffusion
Abstract: Generative models face a fundamental challenge: they must simultaneously learn high-level semantic concepts (what to generate) and low-level synthesis details (how to generate it).
Conventional end-to-end training entangles these distinct, and often conflicting objectives, leading to a complex and inefficient optimization process.
We argue that explicitly decoupling these tasks is key to unlocking more effective and efficient generative modeling.
To this end, we propose Embedded Representation Warmu(ERW), a principled two-phase training framework.
The first phase is dedicated to building a robust semantic foundation by aligning the early layers of a diffusion model with a powerful pretrained encoder.
This provides a strong representational prior, allowing the second phase---generative full training with alignment loss to refine the representation---to focus its resources on high-fidelity synthesis.
Our analysis confirms that this efficacy stems from functionally specializing the model's early layers for representation.
Empirically, our framework achieves a 11.5$\times$ speedup in 350 epochs to reach FID=1.41 compared to single-phase methods like REPA.
Primary Area: generative models
Submission Number: 494
Loading