Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Atli Kosson; Bettina Messmer; Martin Jaggi

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Atli Kosson, Bettina Messmer, Martin Jaggi

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX

Keywords: Learning dynamics of deep neural networks, weight decay, normalization, effective learning rate, AdamW, Adam with L2 regularization, SGDM, spherical motion dynamics, optimization, training, scale invariance, equilibrium, rotational

TL;DR: Weight decay can benefit neural network optimization by balancing the effective learning rate in the form of the average angular update across different layers and neurons, especially when combined with proper normalization.

Abstract: Weight decay can significantly impact the optimization dynamics of deep neural networks. In certain situations the effects of weight decay and gradient updates on the magnitude of a parameter vector cancel out on average, forming a state known as equilibrium. This causes the expected rotation of the vector in each update to remain constant along with its magnitude. Importantly, equilibrium can arise independently for the weight vectors of different layers and neurons. These equilibria are highly homogeneous for some optimizer and normalization configurations, effectively balancing the average rotation—a proxy for the effective learning rate—across network components. In this work we explore the equilibrium states of multiple optimizers including AdamW and SGD with momentum, providing insights into interactions between the learning rate, weight decay, initialization, normalization and learning rate schedule. We show how rotational equilibrium can be enforced throughout training, eliminating the chaotic transient phase corresponding to the transition towards equilibrium, thus simplifying the training dynamics. Finally, we show that rotational behavior may play a key role in the effectiveness of AdamW compared to Adam with L2-regularization, the performance of different normalization layers, and the need for learning rate warmup.

Submission Number: 30

Loading