Tiny Recursive Language Diffusion Models

ACL ARR 2026 January Submission10787 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Language Models, masked diffusion, recursive reasoning, deep supervision, tiny models, denoising, algorithmic generalization, knowledge distillation
Abstract: Autoregressive large language models (ARMs) are effective but brittle on tasks where a single wrong token invalidates the full output and where iterative error correction is essential. In parallel, recent work shows that \emph{tiny} networks can perform strong recursive reasoning on hard puzzle-like sequence tasks via deep supervision and latent recursion \citep{jolicoeur2025less,wang2025hrm}. Separately, masked diffusion language models (MDMs) demonstrate that diffusion-based, non-autoregressive generation can scale and exhibit core language-model capabilities while enabling iterative refinement \citep{nie2025llada,austin2021structured,shi2024simplified,sahoo2024simple,ou2024absorbing}. % We propose \textbf{TR-LDM}, a \emph{tiny recursive} masked diffusion language model that combines (i) a principled masked-diffusion likelihood surrogate \citep{nie2025llada} with (ii) a Tiny Recursive Model (TRM)-style latent reasoning state and recursive refinement \citep{jolicoeur2025less}. TR-LDM uses a single small network that alternates between latent-state updates (``reasoning'') and answer-state updates (``proposal''), and it can be trained either in standard one-step diffusion fashion or with TRM-style deep supervision over denoising iterations (TR-LDM-DS). To make the approach feasible on a single H100 within an hour for algorithmic benchmarks, we present a compute-constrained recipe: $\le$20M parameters, short fixed-length tokenizations, mixed precision, early halting, and optional teacher distillation from a pretrained diffusion LM or ARM teacher. % We provide full algorithms for training and sampling, step-by-step implementation guidance, theoretical results connecting our loss to an upper bound on negative log-likelihood and justifying truncated credit assignment under contraction assumptions, and a complete experimental plan on Sudoku-Extreme and Maze-Hard \citep{wang2025hrm,jolicoeur2025less}.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Language Models, commonsense reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 10787
Loading