TL;DR: We propose a new discrete diffusion ELBO that enables combining masking and uniform noise, which unlocks self-correction capabilities without explicit training.
Abstract: While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled.
Code: https://github.com/dvruette/gidd/
Lay Summary: Modern language models, like those powering chatbots and writing assistants, typically generate text one word (or token) at a time, predicting the next word based on what came before. This method works well but has a key limitation: once a word is written, the model can't go back to fix it, even if it realizes later that it made a mistake.
One promising direction of research to address this is called discrete diffusion, where the model generates text by starting at pure noise and gradually removing the noise, e.g. by filling in missing words, over the course of multiple steps until a complete sentence has emerged. However, the most popular diffusion method, called masked diffusion, still can’t revise earlier word choices effectively.
In this work, we introduce a more flexible version of diffusion called General Interpolating Discrete Diffusion (GIDD). GIDD allows the model to better control how it refines text, making it possible to fix earlier errors. We also develop a new technique for training these models that helps them perform as well as leading discrete diffusion models.
By combining different types of noise, we show that GIDD not only generates higher-quality text, but also learns how to revise and improve its outputs, something today’s models often struggle with. This brings us a step closer to AI systems that can think and write more like humans: drafting, revising, and improving as they go.
Link To Code: https://github.com/dvruette/gidd/
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: diffusion models, discrete diffusion, language models
Submission Number: 4815
Loading