Flat is the New Sharp: Flatness-Aware Regularization for Robust Learning

Flat is the New Sharp: Flatness-Aware Regularization for Robust Learning

ICLR 2026 Conference Submission21421 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Flat minima, Curvature penalty, Hessian trace, Hutchinson’s estimator, Deep Neural Networks.

TL;DR: Flatness-Aware Regularization (FA-Regularization) is a new training approach that adds a Hessian-based penalty to the loss function, explicitly pushing the optimizer toward flatter regions of the loss landscape which leads to better generalization.

Abstract: Understanding and improving the generalization of neural networks has been a central focus in machine learning. One of the most significant efforts to ad- dress this challenge revolves around the concept of the loss landscape in deep neural networks (DNNs). While some researchers have posited that solutions lo- cated in flatter regions of the loss surface tend to generalize better than those in sharper regions, others have provided theoretical frameworks and empirical find- ings suggesting that flat minima are not the sole, or even primary, reason for strong generalization. Despite these advances, the relationship between loss landscape geometry and generalization remains an open question. In this work, we con- tribute to this open question by introducing Flatness-Aware Regularization (FA- Regularization). This method explicitly penalizes the loss surface towards flatter minima by incorporating an estimate of the trace of the squared Hessian into the training loss. We present empirical results demonstrating that this Hessian esti- mate effectively penalizes the curvature of the loss surface, enabling the optimizer to converge to flatter regions. We tested our FA-Regularizer across a variety of models (MLP and Logistic Regression) and datasets (CIFAR-100, IMDB Movie Reviews, and Breast Cancer Wisconsin). Our FA-Regularization method consis- tently leads to improved generalization on cifar-100 compared to a baseline loss function without the penalty term. Our FA-Regularization method indicates that, flatness is shown to correlate with, but not fully explain, generalization.

Supplementary Material: pdf

Primary Area: optimization

Submission Number: 21421

Loading