Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

ICLR 2026 Conference Submission774 Authors

02 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion language model, efficent, block
Abstract: Diffusion Language Models (DLMs) promise parallel generation via iterative denoising, yet their practical speed is often throttled by \emph{schedulers} that accept scattered high-confidence tokens, fragmenting KV caches and forcing repeated local repairs. We present \emph{Prefix Absorption}, a training-free inference principle operationalized by the \emph{Longest Stable Prefix} (LSP) scheduler. In each iteration, LSP performs a single forward pass to locate the longest left-aligned run whose predictions are both high-margin and temporally stable, then snaps the candidate boundary to natural structural delimiters (e.g., punctuation or code boundaries) before atomically committing the block. This prefix-first topology preserves a single frozen/active boundary, converts KV updates into contiguous appends, and concentrates attention on a rapidly shrinking suffix. As a consequence, the active sequence length decays geometrically and the total work bends from an effectively cubic $O(N^3)$ regime toward near-quadratic $O(N^2)$ while maintaining coherence. On code generation (HumanEval, MBPP) and complex reasoning (GSM8K, GPQA) with LLaDA-8B and Dream-7B, LSP substantially reduces end-to-end latency and denoiser calls while matching or improving task quality relative to strong scattered-acceptance baselines. Ablations isolate the gains to LSP’s core components—adaptive block sizing, structural boundary snapping, and the prefix-first commitment topology—demonstrating that faster DLM inference can be achieved without retraining and is complementary to existing diffusion schedules.
Primary Area: generative models
Submission Number: 774
Loading