Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

TMLR Paper2652 Authors

08 May 2024 (modified: 16 Aug 2024)Decision pending for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The training of overparameterized neural networks has received much study in recent literature. An important consideration is the regularization of overparametrized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection, such as by adding noise to the weight matrices before backpropagation, presents limited empirical improvements. To address this limitation, we design a two-point noise injection scheme, which injects noise to the weight matrices along both positive and negative directions of the random noise. In particular, this two-point scheme cancels out first-order expansion terms during the estimation of the Hessian. We show that this regularization improves generalization by proving a PAC-Bayes bound that depends on the trace of the Hessian and the radius of the fine-tuning region. Extensive experiments validate that our approach can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reducing training, showing up to a 2.4% increase in test accuracy (for fine-tuning pretrained ResNets on six image classification datasets). The trace of the Hessian can be reduced by 15.8%, and the largest eigenvalue can be reduced by 9.7%, respectively. Second, the noise injection algorithm can be combined with alternative regularization methods such as weight decay and data augmentation. Third, we show that our approach can be used to improve generalization in pretraining CLIP models and chain-of-thought fine-tuning. Lastly, we also analyze the convergence of our algorithm. Our analysis builds on a connection between minimizing noise-injected functions and stochastic optimization, leading to sharp convergence rates of the above noise-injection algorithm.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Thanks for all of the reviewing feedback on our submission. Here is a list of changes we have made compared to the last submission: - Following Reviewer Z2QG's suggestion, we have significantly restructured the manuscript, reducing the mathematical notations in the abstract and in Section 2. We also have rewritten the abstract and the introduction. We have reduced redundancies in the related work section. - Following Reviewer wuA1's comments, we added a new result related to the trace of the Hessian. - Following Reviewer X1ow's comments, we included new experimental results on pretraining. We also added another experiment on chain-of-thought fine-tuning, based on Reviewer v3GM's comments. Besides, we added various experiments to check the robustness of our comparison to SAM, as suggested by Reviewer X1ow. - Revised Figure's caption, added Figure 4 (to replace the original results in a table). - We have also made numerous other minor revisions based on reviewers' comments.
Assigned Action Editor: ~Yair_Carmon1
Submission Number: 2652
Loading