
\section{Introduction} \label{sec:intro}

Optimization algorithms, though primarily designed to minimize training loss, have increasingly been recognized for their role in implicitly regularizing machine learning models, with some optimizers leading to stronger generalization than others \citep{keskar2017on, wilson2017marginal, ji2020gradient, andriushchenko2023sgd}.
This has motivated extensive efforts to uncover the underlying mechanisms and incorporate these insights into the design of more effective optimizers \citep{izmailov2018averaging,foret2020sharpness,orvieto2022anticorrelated, zhao2022penalizing}.

One prominent line of research examines the relationship between the sharpness of the loss landscape and generalization error, with flatter minima generally associated with improved generalization performance \citep{hochreiter1997flat, keskar2017on, jiang2019fantastic}.
This observation has motivated the development of optimization techniques aimed at encouraging convergence to such flat regions \citep{izmailov2018averaging, chaudhari2017entropysgd, foret2020sharpness, orvieto2022anticorrelated, zhao2022penalizing}.
Notably, SAM \citep{foret2020sharpness} has drawn significant interest for its ability to promote flatter minima and enhance generalization beyond what is typically achieved with standard optimizers \citep{bahri2022sharpness, chenvision, qu2022generalized}.

However, it relies on a seemingly implicit assumption: that the loss landscape provides sufficient variability in flatness for SAM to exploit.
Recent perspectives suggest that overparameterization may be precisely what gives rise to such conditions, as it enlarges the solution space and potentially enables solutions with greater variation in local geometry, such as sharpness \citep{ma2023recent}.
If so, overparameterization might not be merely optional but essential: without it, SAM might fail to produce similar benefits.

This line of reasoning motivates us to conduct a closer examination into the effects of overparameterization on sharpness-aware minimization (SAM) \citep{foret2020sharpness}, with an eye toward understanding not just whether SAM is effective but also under what conditions and why.
Specifically, we conduct extensive experiments to precisely measure the impact of overparameterization across a diverse set of tasks, ranging from standard tasks in computer vision and natural language processing, to molecular property prediction, and further, to video games in reinforcement learning.
To gain further insight into the results, we perform detailed investigations into the interactions between overparameterization and SAM through visual inspection of the solution space on a simple regression setting as well as analyzing the influence of overparameterization on the implicit bias of SAM.
Furthermore, we study how overparameterization influences SAM under various conditions, including label noise, sparsity, and regularization.
Last but not least, we explore other implications of overparameterization on SAM through theoretical analyses, including the characteristics of the attained minima and the convergence rate. 


Our key contributions and findings are summarized as follows.


\begin{itemize} [leftmargin=*]
    \item \textbf{\cref{sec:experiments-main}.} \hspace{.1em}
    We perform extensive experiments across eight workloads of datasets and models at varying scales, spanning synthetic, vision, language, chemistry, and game domains.
    We observe that overparameterization consistently improves the generalization benefit of SAM\footnote{By ``generalization benefit'', we mean the improvement made by SAM over SGD in validation accuracy.}.
    This phenomenon is general and previously unknown\footnote{While evidence of the similar observation can be found in the literature \citep{chenvision}, no prior work has conducted experiments or confirmed this phenomenon at any scale comparable to ours.}.
    \item \textbf{\cref{sec:understanding}.}\hspace{.1em}
    We propose hypotheses to understand this general phenomenon, positing that two factors may be at play:
    (i) overparameterization first increases the number of simpler and flatter solution candidates, and 
    (ii) it also increases the implicit bias of SAM.
    These are verified with standard experiments in both synthetic and realistic settings.
    \item \textbf{\cref{sec:practical}.} \hspace{.1em}
    We present the merits and caveats of overparameterization in employing SAM in practice: (i) the benefit of overparameterization for SAM is more pronounced under label noise and sparsity, while (ii) sufficient regularization is needed.
    This can serve as a useful guidance for practitioners.
    \item \textbf{\cref{sec:theory}.}\hspace{.1em}
    We develop theoretical analyses\footnote{We note that these are not intended to directly support \cref{sec:experiments-main} and \ref{sec:understanding}, which we discuss in \cref{sec:discussion}.} on linear stability and convergence:
    under overparameterization, (i) linearly stable minima for SAM are flatter and have more uniformly distributed Hessian moments compared to SGD, and (ii) a stochastic SAM can converge at a linear rate.
    These are also numerically verified.
    \item \textbf{Overall.}\hspace{.8em}
    We discover that overparameterization has \emph{critical} influences on SAM.
    Both empirical performance and theoretical aspects of SAM all improve with overparameterization.
    In other words, SAM may not take its advantage over SGD without overparameterization.
\end{itemize}  
