Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation

ICLR 2026 Conference Submission15100 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sharpness-Aware Minimization, Optimization, Generalization
TL;DR: Based on empirical and theoretical analysis, we propose a novel interpretation of a key component of Sharpness-Aware Minimization (SAM) and introduce XSAM to address two limitations revealed by this analysis.
Abstract: Sharpness-Aware Minimization (SAM) enhances generalization by minimizing the maximum training loss within a predefined neighborhood around the parameters. However, its practical implementation approximates this as gradient ascent(s) followed by applying the gradient at the ascent point to update the current parameters. Although this practice is justified as approximately optimizing the objective by neglecting the (full) derivative of the ascent point with respect to the current parameters, a direct and intuitive understanding of why using the gradient at the ascent point to update the current parameters works superiorly (despite a shift in location) is still lacking. Our work bridges this gap by proposing and justifying a novel, intuitive interpretation: the gradient at the single-step ascent point, when applied to the current parameters, provides a better approximation of the direction from the current parameters towards the maximum within the local neighborhood than the local gradient, thereby enabling a more direct escape from the maximum within the local neighborhood. Nevertheless, our analysis further reveals that: i) the approximation by the gradient at the single-step ascent point is often inaccurate; and ii) the approximation quality may degrade as the number of ascent steps increases (explaining the unexpectedly inferior performance of multi-step SAM). To address these limitations, we propose in this paper eXplicit Sharpness-Aware Minimization (XSAM), which addresses the first limitation by explicitly estimating the direction of the maximum during training (and then updates parameters along the opposite direction), and the second by crafting a search space that can effectively leverage the information provided by the gradient at the multi-step ascent point. XSAM features a unified formulation that applies to both single-step and multi-step settings and only incurs negligible additional computational overhead. Extensive experiments demonstrate the consistent superiority of XSAM against existing counterparts across various models, datasets, and settings.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 15100
Loading