Faster Inference for Conditional Masked Diffusion Language Models by Knowledge Distillation of Guidance and Trajectory

Published: 30 May 2026, Last Modified: 01 Jun 2026SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Language Models, Diffusion Distillation, Conditional Generation
TL;DR: The paper proposes a distillation framework for Masked Diffusion Language Models by reducing sampling inefficiences of classifier-free guidance and multi-step denoising, achieving higher-quality conditional generation in few-step sampling regimes.
Abstract: Masked diffusion language models (MDLMs) offer a promising paradigm for natural language generation, enabling fully parallel, non-autoregressive decoding through iterative unmasking. However, existing MDLMs incur substantial inference costs due to the large number of neural network function evaluations required, thereby limiting their practical applicability in conditional sequence-to-sequence generation settings. In this paper, we propose a two-stage distillation framework for conditional MDLMs that aims to distill knowledge of both (i) classifier-free guidance and (ii) the denoising trajectory from a teacher MDLM to a student MDLM. As a result, the student model can, during inference, (i) replace the two forward passes required by classifier-free guided outputs with a single pass, and (ii) significantly reduce the number of denoising steps. Evaluations across diverse conditional generation tasks on MDLMs up to 8 billion parameters demonstrate improved generation quality with significantly fewer function evaluations.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 120
Loading