Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

ICLR 2026 Conference Submission7997 Authors

16 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Depth Recurrence, Large Language Model, Looped Transformer, Latent Reasoning
TL;DR: We retrofit depth recurrence into pretrained transformer models.
Abstract: Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to retrofit existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments on grade-school math, we train on math data from Common Crawl and observe that retrofitting pretrained models to be depth-recurrent results in better performance at a given training compute budget than simply post-training the original non-recurrent language model. Further, we train our retrofitted recurrent models on a mixture of FineWeb-Edu and high-quality Nemotron general and math data. We observe that retrofitting can yield performant, general-purpose depth-recurrent language models that improve with test-time compute via scaled recurrence, outperforming static-depth post-trained baselines on a range of common benchmarks including: GSM8K, ARC and PIQA.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7997
Loading