Why Do We Need Weight Decay for Overparameterized Deep Networks?

Francesco D'Angelo; Aditya Varre; Maksym Andriushchenko; Nicolas Flammarion

Why Do We Need Weight Decay for Overparameterized Deep Networks?

Francesco D'Angelo, Aditya Varre, Maksym Andriushchenko, Nicolas Flammarion

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX

Keywords: Weight decay, overparameterization, implicit regularization, optimization dynamics.

Abstract: Weight decay is a broadly used technique for training state-of-the-art deep networks. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via loss stabilization.

Submission Number: 89

Loading