On the Residual Scaling of Looped Transformers: Stability and Transferability

Published: 02 Mar 2026, Last Modified: 18 Mar 2026LIT Workshop @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 5 pages)
Keywords: loop transformer, training stability
Abstract: Looped (weight-tied) Transformers increase effective depth by repeatedly applying a shared block for $L$ steps. In practice, larger $L$ often improves capability, but requires careful hyperparameter tuning. We study the parameterization of pre-norm looped Transformers and ask which residual scaling enables stable training and transferable hyperparameters across loop counts. In contrast to the common $1/\sqrt{L}$ scale in deep networks, our simplified tied-weight residual MLP analysis shows that looped models require $1/L$ residual scaling. We validate theoratical predictions on a standard pre-norm Transformer architecture. Our experiments with looped LLMs across various loop times and learning rates demonstrate that $1/L$ scaling offers significantly better stability and hyperparameter transfer than $1/\sqrt{L}$ scaling
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 93
Loading