
\begin{figure*}[!t]
  \centering
  \hspace{-2em}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_5_SGD.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_10_SGD.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_100_SGD.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_1000_SGD.pdf}\\
  \hspace{-2em}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_5_SAM.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_10_SAM.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_100_SAM.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/function/batch_size6_numneurons_1000_SAM.pdf}
  \caption{
  Solutions found by SGD (top) and SAM (bottom). 
  Both optimizers find similar solutions for under/moderately-parameterized models, whereas the solutions found by SAM are much simpler with less variance compared to those by SGD for overparameterized models. 
  Here, different colors correspond to different random seeds.
  }
  \label{fig:onehidden-function}
\end{figure*}


\section{Understanding why SAM improves with overparameterization}
\label{sec:understanding}

Then why does overparameterization particularly favor SAM over non-sharpness-aware optimizers?
We address this question in this section to better understand the effect of overparameterization on SAM.
Precisely, we posit that it is potentially due to the complementarity between overparameterization enlarging the solution space and the implicit bias of SAM driving toward flat minima;
\ie, once there are more diverse solutions available (including both sharp and flat minima) by overparameterization, optimizers intrinsically biased toward flat solutions (such as SAM) will more likely find such solutions than unbiased optimizers (such as SGD).
We support this reasonable hypothesis by demonstrating the followings: 
(i) SAM finds simpler and flatter solutions than SGD with the enlarged solution space (\cref{sec:understanding-solution}), and 
(ii) the implicit bias of SAM becomes stronger with overparameterization (\cref{sec:understanding-implicitbias}); 
both of these take place only when the model is overparameterized.

\subsection{Enlarged solution space allows SAM to find simpler and flatter solutions}
\label{sec:understanding-solution}






To corroborate our hypothesis, we start with a simple experiment where we train one-hidden-layer ReLU networks using SAM and SGD following \citet{andriushchenko2022towards};
we use $5$, $10$, $100$, and $1000$ hidden neurons for underparameterized to highly overparameterized cases;
we run three random seeds and compare solutions obtained by SAM and SGD in \cref{fig:onehidden-function}.

First, we find that the solutions found by SAM are not differentiated much from those of SGD when the model has no more than $10$ neurons. % (see \cref{fig:onehidden-function-10sgd,fig:onehidden-function-10sam}).
Looking closely into the case of $10$ neurons, they all seem to be roughly $4$ to $6$ degrees of piecewise linear functions, \ie, the number of line segments for each solution is less than $10$, which is the maximum possible joints that this model can have in theory.
On the other hand, in the case of $100$ to $1000$ neurons, one can easily see that the solutions found by SAM are much simpler (and thus more likely to generalize) compared to those by SGD.


Next, we also track the optimization trajectories of both SAM and SGD.
The trajectories are plotted along PCA directions calculated from the converged minima following \citet{li2018visualizing}.
The results are illustrated in \cref{fig:onehidden-trajectories}.
We find that both SAM and SGD reach solutions in a similar basin when the model is under/moderately parameterized, whereas in the overparameterized case, they reach different solutions, \ie, SAM reaches a flatter solution, even though they all start from the same initial point.

These results support the idea that SAM has some implicit bias that drives itself toward a certain type of solutions (\eg, simple and flat) as previously shown in prior work \citep{andriushchenko2022towards,compagnoni2023sde,wensharpness}.
More importantly, however, these results newly reveal that \emph{overparameterization is a critical factor in facilitating this implicit behavior of SAM};
without it the space of potential solutions decreases, and SAM may not take effect.


\begin{figure*}[!t]
  \centering
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/contour/numneurons5_sgd2_sam1_anchor0.0_10.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/contour/numneurons10_sgd2_sam1_anchor0.0_10.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/contour/numneurons100_sgd2_sam1_anchor0.0_10.pdf}
  \includegraphics[width=0.24\linewidth]{figures/onehidden_relu/contour/numneurons1000_sgd2_sam1_anchor0.0_10.pdf}
  \caption{
  Optimization trajectories of SGD and SAM starting from the same initial point.
  SGD and SAM reach solutions in a similar basin for under/moderately-parameterized models, whereas they reach different solutions for overparameterized models, \ie, flatter region for SAM.
  }
  \label{fig:onehidden-trajectories}
\end{figure*}


\begin{figure*}[!t]
    \centering
    \centering
    \includegraphics[width=0.16\linewidth]{figures/mnist/rho/filter_75,_25__val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/mnist/rho/filter_150,_50__val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/mnist/rho/filter_300,_100__val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/mnist/rho/filter_1200,_400__val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/mnist/rho/filter_3000,_1000__val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/mnist/rho/best_rho.pdf}
    
    \includegraphics[width=0.16\linewidth]{figures/cifar/ResNet18/sam_rho/filter4_val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/cifar/ResNet18/sam_rho/filter16_val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/cifar/ResNet18/sam_rho/filter32_val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/cifar/ResNet18/sam_rho/filter64_val_acc.pdf}
    \includegraphics[width=0.16\linewidth]{figures/cifar/ResNet18/sam_rho/filter256_val_acc.pdf}
    \includegraphics[width=0.162\linewidth]{figures/cifar/ResNet18/sam_rho/best_rho.pdf}
  \caption{
    Validation accuracy versus $\rho$ for $3$-layer MLP trained on MNIST (top) and ResNet-18 trained on CIFAR-10 (bottom).
    $\rho^\star$ is located to be higher with more parameters.
  }
  \label{fig:result_overparam_each_rho}
\end{figure*}

\subsection{Implicit bias of SAM increases with overparameterization}
\label{sec:understanding-implicitbias}


While overparameterization can secure favorable conditions for SAM, it is not to be confused with guaranteeing the implicit bias of SAM taking effect.
In fact, we can further relate the implicit bias of SAM to the perturbation bound $\rho$ to bridge this gap.
Specifically, SAM can be interpreted as SGD on an implicitly regularized loss based on SDE (stochastic differential equation) modeling \citep{compagnoni2023sde}:
\begin{equation}
    \tilde{f}(x) \coloneqq f(x) + \rho \E\|\nabla f_\gamma (x)\|_2
\end{equation}
where $\gamma$ refers to some stochasticity.
This indicates that SAM becomes more regularized (\ie, the implicit bias is amplified) when $\rho$ increases.\footnote{This holds as long as $\rho$ is not too large, by which it might overshadow minimizing $f$ and implicitly bias the optimizer toward stationary points such as saddles and maxima.
Note it reduces to standard SGD when $\rho=0$.
}





Our interest thus lies in seeing whether overparameterization has any effect on increasing $\rho$.
Since if that is the case, it indeed means that overparameterization puts more regularization on SAM.
We verify this by finding the empirically optimal perturbation bound $\rho^\star$ that yields the best generalization performance as we change the degree of overparameterization.
Specifically, we take a standard deep learning task and perform an extensive grid search to find $\rho^\star$.
The result is displayed in \cref{fig:result_overparam_each_rho}.

Indeed, it is observed that $\rho^\star$ tends to increase as the number of parameters increases;
\ie, seeing from left to right, $\rho$ value that yields highest accuracy (marked as green star \textcolor{ForestGreen}{$\star$}) tends to increase.
We confirm that this trend is consistently observed for various other workloads (See \cref{fig:result_overparam_each_rho_mnist,fig:result_overparam_each_rho_cifar,fig:result_overparam_each_rho_imagenet,fig:result_overparam_each_rho_sst} of \cref{app:add-rho} for more results).
This result is certainly encouraging since it supports that \emph{the generalization benefit of SAM via implicit regularization can indeed increase by overparameterization}.

Additionally, we can develop a conceptual account of why $\rho^\star$ increases with overparameterization.
First, if we consider the expected effect of perturbation $\epsilon \in\mathbb{R}^d$ of size $\rho$ on individual parameters simply as $\mathbb{E}_k[\epsilon_k^2]=||\epsilon||_2^2/d=\rho^2/d$, we can see that $\mathbb{E}_k[\epsilon_k^2] \rightarrow 0$ as $d\rightarrow\infty$, which implies that SAM would eventually have almost no effect on each parameter as the model scales unless $\rho$ is also increased.

Also, the Lipschitz bound on the gradients reveals that $\left\lVert \nabla f\left(x+\epsilon\right) - \nabla f(x)\right\rVert_2 \leq \beta \left\lVert x+\epsilon - x\right\rVert_2 = \beta \rho$, indicating that the SAM gradient becomes more similar to the original gradient as the model gets smoother (\ie, smaller smoothness constant $\beta$) with increasing size, requiring larger perturbation bound to achieve similar levels of perturbation effect.
These hold under the assumption that overparameterization makes the model smoother, which we empirically confirm in \cref{fig:result_empirical_lipschitz}.
