\vspace{-0.5em}
\section{Conclusion}

\label{sec:discussion}

\vspace{-0.5em}

In this work, we have disclosed the \textit{critical influence of overparameterization on SAM} from empirical and theoretical perspectives.
We started with an extensive evaluation to display a highly consistent trend that the generalization benefit of SAM increases with overparameterization, without which SAM may not take effect (\cref{sec:experiments-main}).
This led us to come up with a reasonable hypothesis to explain the benefit in terms of increased solution space and implicit bias (\cref{sec:understanding}).
In addition, we presented further merits and caveats of overparameterization in practice (\cref{sec:practical}).
Finally, we developed theoretical advantages of overparameterization for SAM on linear stability, convergence, and generalization (\cref{sec:theory}).
We believe these findings can bridge between overparameterization and SAM, which has been rather unattended in the literature as of yet.
Nevertheless, we discuss limitations, ideas for potential future work as well as practical implications of our results below.


\paragraph{Theoretical account of \cref{sec:experiments-main}}
The consistent trend observed in \cref{sec:experiments-main} certainly hints at the presence of a fundamental process underneath, and yet, our study does not offer a precise theory to support this phenomenon.
This is largely because modeling the generalization of SAM under varying degrees of overparameterization challenges the boundaries of existing theoretical frameworks currently available in the literature.
Nevertheless, drawing upon recent advancements in understanding overparameterization and generalization, we have developed plausible hypotheses to directly address this phenomenon (\cref{sec:understanding}). 
We also employed rigorous theoretical frameworks to examine the effects of overparameterization on various other aspects of SAM, reinforcing the general trend of overparameterization benefits (\cref{sec:theory}).
We believe these efforts offer valuable insights and preliminary foundations that could be instrumental in achieving a comprehensive theoretical account of \cref{sec:experiments-main} in the future.


\paragraph{Other sharpness minimization schemes}
Our theoretical results in \cref{sec:theory} are based on an unnormalized version of SAM.
This is largely driven by two reasons:
(i) it appears to render minimal practical difference from the original SAM, and more crucially,
(ii) it simplifies analyses as widely adopted in initial studies \citep{andriushchenko2022towards, compagnoni2023sde}.
However, more recently, works such as \citet{dai2023crucial, si2023practical} have highlighted the theoretical significance of the normalization step.
We plan to extend our analysis to better reflect the effect of normalization in future work.
Additionally, given that different sharpness minimization schemes can make a difference in the found minima and resulting performance \citep{kaddour2022flat, dauphin2024neglected}, extension of our analyses to other non-SAM sharpness minimization schemes \citep{izmailov2018averaging,orvieto2022anticorrelated} and studying how they compare to SAM under overparameterization would be a promising avenue for future work.
Nonetheless, we consider these results an initial exploration of the impact of overparameterization on SAM, setting the stage for future research.



\paragraph{More ablation study}

In addition to label noise, sparsity, and regularization from \cref{sec:practical}, we investigate the influence of other factors on the increased benefit of SAM in \cref{app:ablation}.
Specifically, in \cref{app:depth}, we explore the effect of increasing the depth instead of the width, where we find that the advantages differ across architectures with MLPs appearing to benefit more significantly than ResNets.
We suspect that this may result from the complex interplay of various intricate factors and decisions involved in increasing depth in modern architecture.
Also, in \cref{app:exp-linear}, inspired by recent studies suggesting that overparameterized models can behave like linearized models \citep{jacot2018neural, chizat2019lazy}, we test if the increased benefit of SAM is due to linearization.
As a result, we have observed that SAM underperforms SGD in the linearized regimes by more than $-10\%$.
This indicates that linearization is not the main factor for the increased benefit of SAM and again verifies that overparameterization itself is likely to be the main factor of the benefit.


\paragraph{Potential to modern deep learning}

Our key observations in \cref{sec:experiments-main} indicate a great potential to use SAM in the modern landscape of large-scale training \citep{kaplan2020scaling,belkin2021fit}.
Also, our results in \cref{sec:practical} further highlight its potential in the current trend where foundation models are often trained with noisy data \citep{radford2021learning,schuhmann2022laion} or to employ sparsity \citep{frantar2023scaling,jiang2024mixtral}.
In this regard, we can possibly anticipate that the overparameterization benefit might hold even when training billion-scale foundation models \citep{zhang2022opt,dehghani2023scaling}, which we leave to explore as future work.
It would also be interesting to study how popular settings for training foundation models other than label noise or sparsity affect the benefit, such as quantization \citep{gholami2022survey}, dataset pruning \citep{agiollo2024approximating}, differential privacy \citep{yu2021differentially}, or human alignment \citep{ouyang2022training}.


