Neglected Hessian component explains mysteries in sharpness regularization

Yann Dauphin; Atish Agarwala; Hossein Mobahi

Neglected Hessian component explains mysteries in sharpness regularization

Yann Dauphin, Atish Agarwala, Hossein Mobahi

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: sharpness, flatness, regularization

TL;DR: Understanding the neglected indefinite part of the Hessian explains important phenomena in sharpness regularization

Abstract: Recent work has shown that methods that regularize second order information like SAM can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We investigate this inconsistency and reveal its connection to the the structure of the Hessian of the loss. Specifically, its decomposition into the positive semi-definite Gauss-Newton matrix and an indefinite matrix, which we call the Nonlinear Modeling Error (NME) matrix. Previous studies have largely overlooked the significance of the NME in their analysis for various reasons. However, we provide empirical and theoretical evidence that the NME is important to the performance of gradient penalties and explains their sensitivity to activation functions. We also provide evidence that the difference in regularization performance between gradient penalties and weight noise can be explained by the NME. Our findings emphasize the necessity of considering the NME in both experimental design and theoretical analysis for sharpness regularization.

Primary Area: Other (please use sparingly, only use the keyword field for more details)

Submission Number: 21353

Loading