Keywords: gengeralization, training algorithm, Sharpness-aware Minimization, machine learning
TL;DR: We rethink flat minima in modern over-parameterized networks prone to overfitting, suggesting $epsilon$-Maxima as a more suitable alternative and demonstrating promising generalization across different tasks.
Abstract: Modern deep neural networks are often over-parameterized, leading to significant overfitting issues: achieving a near-zero training loss while potentially generalizing poorly. In response, by employing Sharpness-Aware Minimization (SAM), seeking flat minima has been widely adopted as a common belief for achieving a better generalization, heuristically assuming that model parameters located in low-curvature regions of the training loss landscape will induce the same low loss values over the underlying data distribution. However, considering the inscrutable geometric structure of the real data distribution loss landscape, flat minima may not be the only optimal solution. We question whether an alternative geometric structure of the training loss landscape could offer better generalization over the underlying data distribution. To formalize this, we propose to seek an $\epsilon$-maxima point that achieves a loss value at least $\epsilon$ greater than all points within a punctured perturbation domain of a given radius. We demonstrate that seeking such a point by leveraging our novel optimization framework, $\epsilon$-MS, surpasses both SAM and SAM-based methods on standard generalization benchmarks. Moreover, in stronger generalization scenarios—including long-tailed recognition and single-domain generalization, $\epsilon$-MS exhibits clear advantages. In particular, it achieves state-of-the-art performance on standard generalization benchmarks and long-tailed recognition tasks, highlighting its promising generalization performance across diverse training scenarios.
Primary Area: optimization
Submission Number: 6923
Loading