Keywords: Diffusion Models, Mixture of Experts, Efficient Foundation Model
TL;DR: We enhance diffusion models by distilling to multiple students, for (a) lifted quality by data subset specialization and (b) enhanced latency by distilling to smaller models, allowing 1-step generation, now with smaller, lower-latency architectures.
Abstract: Foundation models that generate photorealistic video with text or image guidance promise compelling augmented‑reality (AR) experiences, yet their prohibitive test‑time compute prevents true real‑time deployment. We focus on the dominant diffusion family and show that Multi-Student Distillation (MSD) increases effective model capacity without increasing — or even reducing — latency, memory footprint, or energy per sample. MSD partitions the conditioning space and trains a lightweight one‑step generator per partition, allowing (i) higher sample quality at fixed latency and (ii) smaller per‑student backbones that for edge/low-latency budgets.
Submission Number: 11
Loading