CD$^{4}$LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

CD$^{4}$LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

ACL ARR 2026 January Submission5857 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Language Models, Non-Autoregressive Generation, Efficiency, Discrete-Space Consistency Distillation, Adaptive Decoding, Parallel Decoding, Trajectory Invariance

Abstract: Autoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency-limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static-to-dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive "long-jump" refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD$^{4}$LM, a framework that decouples training from inference via Discrete-Space Consistency Distillation (DSCD) and Confidence-Adaptive Decoding (CAD). DSCD trains a student to be trajectory-invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD$^{4}$LM matches the LLaDA baseline with a 5.18$\times$ wall-clock speedup; across code and math benchmarks, it pushes the accuracy-efficiency Pareto frontier, achieving a 3.62$\times$ mean speedup while improving average accuracy.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Generation, Language Modeling, NLP Applications

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Theory

Languages Studied: English

Submission Number: 5857

Loading