Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Sean Michael McLeish; Ang Li; John Kirchenbauer; Dayal Singh Kalra; Brian R. Bartoldson; Bhavya Kailkhura; Avi Schwarzschild; Jonas Geiping; Tom Goldstein; Micah Goldblum

Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Sean Michael McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Depth Recurrence, Large Language Model, Looped Transformer, Latent Reasoning

TL;DR: We retrofit depth recurrence into pretrained transformer models.

Abstract: Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to retrofit existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments on grade-school math, we train on math data from Common Crawl and observe that retrofitting pretrained models to be depth-recurrent results in better performance at a given training compute budget than simply post-training the original non-recurrent language model. Further, we train our retrofitted recurrent models on a mixture of FineWeb-Edu and high-quality Nemotron general and math data. We observe that retrofitting can yield performant, general-purpose depth-recurrent language models that improve with test-time compute via scaled recurrence, outperforming static-depth post-trained baselines on a range of common benchmarks including: GSM8K, ARC and PIQA.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7997

Loading