Why Are Distribution-Matching Distilled Students Lazy? Understanding the Copying Behavior in Few-Step Distillation
Keywords: Diffusion Models, Distillation, Noise-data Pairings, Memorization
TL;DR: This paper investigates why single-step Distribution Matching Distillation (DMD) students spontaneously reproduce the teacher's exact noise-to-data mappings, despite trained to only match the final data distribution.
Abstract: Distribution Matching Distillation (DMD) compresses pretrained diffusion models into efficient few-step generators by aligning their noised distributions across all scales. In principle, such distribution-level supervision remains agnostic to the teacher’s specific noise--data pairings; this provides the student the freedom to remap latent noise, a behavior consistently observed in low-dimensional settings. Surprisingly, we find that in high-dimensional settings, distilled students spontaneously reproduce the teacher’s original noise--data pairings—a phenomenon we term copying. We demonstrate that copying is neither a byproduct of GAN objectives nor a result of teacher memorization. Instead, our evidence suggests that copying is an emergent property arising from the limited geometric freedom of the student model during high-dimensional distillation.
Submission Number: 71
Loading