Do Depth-Grown Models Overcome The Curse Of Depth? An In-Depth Analysis

ICLR 2026 Conference Submission20628 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: stacking, language model, reasoning, efficient training, depth analysis
TL;DR: We analyse how growing Transformers gradually in depth improves reasoning by boosting depth utilization and inducing modular computational circuits.
Abstract: Gradually growing the depth of Transformers during training cannot only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half-also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we show that growth via gradual middle stacking yields more effective utilization of model depth, changes in the residual stream structure, and formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how gradual growth of model depth can lead to formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20628
Loading