Abstract: Diffusion language models (DLMs) have emerged as promising alternatives to autoregressive models (ARMs) due to their bidirectional attention and parallel decoding. However, their inference cost becomes significantly higher as they scale. Layer skipping addresses this challenge by selectively omitting redundant layers. While these dynamic approaches are effective in ARMs, they cannot be naturally extended to DLMs because their parallel generation paradigm makes fine-grained token-level routing challenging. We propose DIALS, a novel dynamic layer-skipping framework for DLMs. DIALS places a lightweight router before each Transformer layer, aggregating masked token representations to make a unified, sequence-level decision on whether to skip or execute the layer. Evaluated on LLaDA-8B across six benchmarks, DIALS generally achieves a better FLOPs-accuracy trade-off compared to static and random layer-skipping baselines. On PIQA, it reduces inference FLOPs by 14.26% without losing accuracy. Our analysis further shows that initial layers are consistently important. Additionally, by incorporating a scaling term based on the mask ratio into the routing objective, we reveal that inherent layer redundancy emerges as denoising progresses.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Mufan_Li1
Submission Number: 8491
Loading