Keywords: Depth Recurrence, Large Language Model, Looped Transformer, Latent Reasoning, 9 pages
TL;DR: 9 Pages. We retrofit depth recurrence into pretrained transformer models.
Abstract: Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute.
In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models.
We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost.
In our experiments on grade-school math, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.
Submission Number: 196
Loading