Correction of Decoupled Weight Decay

Correction of Decoupled Weight Decay

ICLR 2026 Conference Submission21917 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AdamW, weight decay, Scion

TL;DR: Decoupled weight decay proportional to LR^2 results in stable weight & grad norms for both AdamW and Scion.

Abstract: Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto \gamma^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto \gamma^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. As an example, we generalize our findings to constrained Scion and show that decoupled weight decay $\propto \gamma^2$ leads to stable weight and gradient norms and improved model performance.

Primary Area: optimization

Submission Number: 21917

Loading