Dual Distillation of Trajectory and Guidance Knowledge for Faster Inference in Conditional Masked Diffusion Language Models

Tejomay Kishor Padole; Suyash P. Awate; Pushpak Bhattacharyya

Dual Distillation of Trajectory and Guidance Knowledge for Faster Inference in Conditional Masked Diffusion Language Models

Tejomay Kishor Padole, Suyash P. Awate, Pushpak Bhattacharyya

20 Sept 2025 (modified: 19 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Language Models, Knowledge Distillation, sequence-to-sequence NLP, non-autoregressive generation

TL;DR: We propose a two-stage distillation approach for conditional masked diffusion language models (MDLMs) applied to seq-to-seq NLP tasks that tackles sampling inefficiencies in computing guided and multi-step denoising outputs from MDLMs.

Abstract: Masked diffusion language models (MDLMs) have emerged as a promising generative framework for natural language, owing to parallel non-autoregressive generation capabilities with iterative unmasking/denoising. However, typical MDLMs require a very large number of neural network function evaluations for effective inference, making them computationally expensive in many real-world NLP applications that rely on conditional sequence-to-sequence generation. In this work, we propose a two-stage distillation method for conditional MDLMs that distills knowledge of (i) classifier-free guidance as well as (ii) unmasking trajectory from the existing teacher MDLM into a student MDLM. This allows the student MDLM, during inference, to (i) reduce two forward passes, required by a classifier-free guided (teacher) MDLM, to a single pass, and (ii) drastically reduce the number of unmasking steps. In this way, by dual distillation of guidance and trajectory knowledge, our MDLM achieves speedups of up to 16$\times$ while virtually retaining the quality of generation.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24546

Loading