Efficient Parallel Samplers for Recurrent-Depth Models and Their Connections to Diffusion Language Models

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Recurrent-Depth, Latent Reasoning, Efficiency, Diffusion Forcing, Parallelization, Inference
TL;DR: The text generation from recurrent-depth models can be significantly accelerated by treating them as diffusion models and using diffusion forcing samplers.
Abstract: Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts at pretraining have shown that these architectures are able to scale to modern language modeling tasks while showing advantages in reasoning. In this work, we analyze the relationship of these architectures to language diffusion models. In doing so, we develop a new sampler for these models based on diffusion forcing that speeds up generation by a factor of around 5x. The sampler advances by decoding new tokens at every forward pass through the model while the latent states of these tokens are still being refined in parallel through recurrence. Interestingly this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers, eschewing any tuning. From this perspective, our findings not only provide an efficient way of parallelizing the extra compute required for models with recurrent depth at inference, but also imply that we could conceptualize existing models as strong continuous, albeit still causal, diffusion language models.
Submission Number: 169
Loading