Abstract: We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code will be made available.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank all three reviewers for their time and valuable feedback. The raised concerns have helped us identify important gaps in the experimental evaluation and presentation of the paper, and addressing them has
meaningfully improved the quality of the manuscript. We are encouraged that the reviewers recognize the principled nature of the proposed moment-matching framework, its training-free applicability to pre-trained diffusion and
rectified flow models, and its strong empirical performance particularly in the low sampling step regime.
In response to the raised concerns, we have made the following changes to the manuscript:
1. **Notation**: We have changed the time variable notation from $t$ to $\tau$ in the Rectified Flow Matching section (Sec. 4) to clearly distinguish between the continuous-time RFIM setting and the discrete-time diffusion
setting used elsewhere in the paper.
2. **Forward process GMM form**: The GMM form of the forward marginals has been explicitly written out in Appendix A.3.
3. **Ablation studies**: Appendix A.8 previously reported ablations on the number of mixture components $K$ and offset scaling factor $s$ only for the DDIM-GMM-ORTHO-VUB variant. We have extended these ablations to all three
variants (DDIM-GMM-RAND, DDIM-GMM-ORTHO, and DDIM-GMM-ORTHO-VUB), including an additional scale value of $s=20$.
4. **Computational overhead**: We have added wall clock timing results to Appendix A.10, quantifying the per-sample sampling latency and initialization overhead of each DDIM-GMM variant relative to DDIM on a NVIDIA Tesla V100
GPU. The per-sample overhead of the DDIM-GMM samplers relative to DDIM is small (3--8%).
5. **Typo fix**: The typo in Section 3.1 ($\delta_k^t$ corrected to $\delta_t^k$) has been fixed.
6. **Code release**: We will be open sourcing our DDIM-GMM implementation, with the goal of making it directly usable within widely adopted frameworks such as Stable Diffusion and Diffusers. We have included the code as a zip
file in the supplementary material. This will allow the community to reproduce our results, benchmark against our approach, and build upon it.
We also thank the Action Editor for their time and helpful guidance throughout the review process.
Assigned Action Editor: ~Valentin_De_Bortoli1
Submission Number: 5610
Loading