Keywords: Diffusion Model, Language Generation
TL;DR: Continous diffusion language model with exact manifold geometry modeling.
Abstract: We formulate a powerful generative framework for language, premised upon modeling discrete token generation as continuous trajectories of a Gaussian stochastic process in a Euclidean space. Specifically, to address the challenge of the high-dimensional discrete nature inherent in language data, we devise two core components: a projection function to embed discrete tokens into a continuous domain and a metric function to infer the conditional probability distribution of subsequent tokens from continuous embeddings. Subsequently, we employ a forward diffusion process that incrementally perturbs the data distribution towards a tractable standard Gaussian prior. To learn the generative reverse process, we formulate a novel \textit{data geometry-aware score} that explicitly exploits the inherent manifold structure of the discrete language data to refine the fidelity of the score approximation. Since direct optimization of the score function is intractable, we propose optimizing a tractable surrogate objective, the Relaxed Evidence Lower Bound (RELBO), which ensures a bounded approximation error via continuous relaxation. Finally, we critically reassess conventional evaluation protocols and introduce a novel comprehension score, designed to enable a more robust and equitable performance comparison against competing architectures. Empirical validation on the LM1B and OpenWebText benchmarks corroborates the effectiveness of our proposed framework. Our results significantly outperform SOTA discrete diffusion and autoregressive (AR) schemes, indicating a very promising direction for language modeling.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 4414
Loading