Keywords: diffusion language models, discrete diffusion, distillation
TL;DR: Distilling a pre-trained autoregressive language model into a diffusion-based language model with proposed Target Concrete Score objective.
Abstract: Transformer-based Large Language Models (LLMs) have demonstrated remarkable capa-
bilities, yet their autoregressive nature forces sequential token-by-token decoding, leading
to inefficiencies during inference. Furthermore, autoregressive language models lack in-
herent self-correction abilities, which hinders their capacity to refine and improve gener-
ated content without relying on external prompting or retraining techniques. In contrast,
diffusion-based models offer the advantage of fast parallel generation through iterative
refinement, while leveraging bi-directional attention to utilize full context at once. How-
ever, diffusion models are unable to match their autoregressive counterparts. This moti-
vates us to explore the possibility of distilling a pre-trained autoregressive (AR) language
model (teacher) into a non-autoregressive diffusion (non-AR) language model (student),
combining the best of both worlds. In this work, we present Target Concrete Score (TCS)
distillation, a theoretically grounded framework that bridges autoregressive and diffusion
paradigms. TCS distillation is broadly applicable to both discrete and continuous diffu-
sion models, with any pre-trained autoregressive teacher model. We propose techniques
to make TCS distillation scalable and efficient for transformer-based models, and show
how it can both improve pre-trained diffusion language models and also train new mod-
els from scratch. Through comprehensive experiments on language modeling tasks, we
demonstrate the effectiveness of our proposed methods.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 580
Loading