Track: tiny / short paper (up to 5 pages)
Keywords: loop transformer, training stability
Abstract: Looped (weight-tied) Transformers increase effective depth by repeatedly applying a shared block for $L$ steps.
In practice, larger $L$ often improves capability, but requires careful hyperparameter tuning.
We study the parameterization of pre-norm looped Transformers and ask
which residual scaling enables stable training and transferable
hyperparameters across loop counts.
In contrast to the common $1/\sqrt{L}$ scale in deep networks, our simplified tied-weight residual MLP analysis shows that looped models require $1/L$ residual scaling.
We validate theoratical predictions on a standard pre-norm Transformer architecture.
Our experiments with looped LLMs across various loop times and learning rates demonstrate that $1/L$ scaling offers significantly better stability and hyperparameter transfer than $1/\sqrt{L}$ scaling
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 93
Loading