TL;DR: We conduct theoretical analysis of current discrete diffusion models and propose a method to effectively capture element-wise dependency that is ignored in conventional models.
Abstract: Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in language) mainly due to the computational cost of processing high-dimensional joint distributions. In this paper, (i) we propose "mixture" models for discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and (ii) we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: First, conventional models with element-wise independence can well approximate the data distribution, but essentially require *many sampling steps*. Second, our loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains. The code used in the paper is available at https://github.com/sony/di4c.
Lay Summary: Many modern machine learning systems generate images or text by slowly refining a noisy input over many small steps. However, this slow process makes it difficult to use these models in real-time applications. Moreover, many existing methods ignore the natural dependencies among parts of the data (for example, relationships between neighboring image pixels), which becomes a bigger issue when fewer sampling steps are used.
We introduce a new method called Di4C that “distills” a well-trained but slow diffusion model into a faster version in discrete domains. By using a mixture model, Di4C captures the hidden relationships (or correlations) between different parts of the data, and a specially designed loss function teaches the fast model to closely mimic the original, multi-step process even when using just a few steps.
This work allows for much quicker generation of high-quality images and language, making advanced machine learning tools more practical and accessible for real-world applications.
Link To Code: https://github.com/sony/di4c
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: diffusion model, discrete diffusion, distillation, consistency model, dimensional correlation, convergence analysis
Submission Number: 2396
Loading