Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
Keywords: Diffusion Language Models; Continuous Bitstream Diffusion; Non-Autoregressive Language Generation; Entropy-Gated Stochastic Sampling; Scalable Vocabulary Modeling
Domains: Language and Learning
TL;DR: We model language as a continuous diffusion process over bitstreams and introduce an entropy-rate-gated stochastic sampler that substantially narrows the quality-diversity gap to autoregressive baselines.
External Link: https://arxiv.org/pdf/2605.07013
Abstract: Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting that continuous state spaces are highly effective for language modeling. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and uses a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we introduce a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity (Gen. PPL) of 59.76 at matched real-data entropy (4.31) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving Gen. PPL = 27.06 at an entropy of 5.26 using 4× fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the O(V) vocabulary scaling bottleneck shared by standard DLMs. By predicting O(log V) bitwise logits through semantic bit-patching, our model reduces memory footprint and improves throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.
Submission Number: 83
Loading