\section{Key observation: SAM improves with overparameterization}
\vspace{-0.5em}

\label{sec:experiments-main}

\begin{table*}[!th]
    \centering
    \resizebox{0.9\linewidth}{!}{
    \begin{tabular}{c c c c c c c}
      \toprule
      Workload \# & Domain & Task & Dataset & Architecture & Model \\
      \midrule
      $1$ &  Synthetic         & Regression   & Synthetic & MLP & Two-layer MLP \\ 
      $2$ & Vision         & Image classification   & MNIST & MLP & LeNet-300-100 \\    
      $3$ & Vision  & Image classification & CIFAR-10 & CNN & ResNet-18\\ 
      $4$ & Vision  & Image classification & ImageNet & CNN & ResNet-50\\ 
      $5$ & Language  & PoS tagging & Universal Dependencies & Transformer & Encoder-only Transformer\\ 
      $6$ & Language  & Sentiment classification & SST-2 & RNN & LSTM\\ 
      $7$ & Chemistry  & Graph property prediction & ogbg-molpcba & GNN & GCN\\ 
      $8$ & Game  & Proximal policy optimization & Atari Breakout & CNN & Five-layer CNN\\ 
      \bottomrule
  \end{tabular}
  }
  \caption{
    Summary of evaluation workloads.
    They cover eight different datasets spanning five domains and six tasks at varying scales, and include eight neural network models of five different architecture types.
    For each workload, we test up to ten different models of varying degrees of parameterization.
  }
  \label{tab:workloads}
\end{table*}

\begin{figure*}[!th]
\vspace{-0.4em}
    \centering
  \begin{subfigure}{0.9\linewidth}
      \centering
      \includegraphics[width=0.24\linewidth]{figures/synth/sam-sgd/synth-loss_diff.pdf}
      \includegraphics[width=0.24\linewidth]{figures/mnist/overparam_diff/overparamwd[0.0001].pdf}
      \includegraphics[width=0.24\linewidth]{figures/cifar/ResNet18/overparam_diff/overparamwd[0.0005].pdf}
      \includegraphics[width=0.24\linewidth]{figures/imagenet/overparam_diff/overparam.pdf}
      \includegraphics[width=0.24\linewidth]{figures/pos/overparam_diff/overparam_bestvalacc.pdf}
      \includegraphics[width=0.24\linewidth]
      {figures/sst/overparam_diff/overparam_bestvalacc.pdf}
      \includegraphics[width=0.24\linewidth]{figures/gnn/overparam_diff/overparam.pdf}
      \includegraphics[width=0.24\linewidth]{figures/ppo/overparam_diff/overparam_bestvalacc.pdf}
  \end{subfigure}
    \vspace{-0.4em}
  \caption{
    Improvement in validation metrics by SAM.
    The generalization benefit of SAM tends to increase as the model becomes more overparameterized.
    We present the full results including the absolute metrics for SAM and baseline optimizers in \cref{fig:result_overparam_acc_app} of \cref{app:add-overparam_results}.
  }
  \label{fig:result_overparam_acc}
    \vspace{-0.7em}
\end{figure*}

SAM is introduced to find flat minima and thereby improve generalization performance in practice.
In this work, we are interested in whether and how this improvement is affected by overparameterization.
In order to understand any potential relationship between SAM and overparameterization, we first focus on precisely measuring the effect of overparameterization.
More specifically, we conduct a wide range of deep learning experiments (see \cref{tab:workloads} for the summary of all tested workloads), and observe how the generalization improvement made by SAM changes as with more parameters. 

As a result, we find a strong and consistent trend that SAM improves with overparameterization in all tested cases (see \cref{fig:result_overparam_acc}).
To elaborate, initially, SAM does \emph{not} work much better than the non-sharpness-aware baseline optimizer (\ie, SGD or Adam family depending on the default choice) when the model is at relatively low number of parameters;
it only starts to improve with more parameters and makes a clear distinction at very large number of parameters.
We emphasize that this holds true for a wide variety of architectures (MLP, CNN, RNN, GCN, Transformer) and datasets of different domains (Synthetic, Vision, Language, Chemistry, Game) under a rigorous hyperparameter search (see \cref{app:exp_details} for the full experiment details).

This result possibly indicates that SAM is more effective, when (and possibly only when) applied to overparameterized models.
On the other hand, the increased generalization performance of SAM with more parameters renders a promising avenue, given that the modern neural network models are often heavily overparameterized \citep{zhang2022opt,dehghani2023scaling}.
We note that some evidence of the similar positive influence of overparameterization for SAM can be derived in the literature \citep{chenvision}, however, no prior work has conducted experiments or confirmed this phenomenon at any scale comparable to ours.\footnote{
As an additional result, we provide a theoretical analysis of the effect of overparameterization decreasing the test error of SAM in \cref{sec:theory-gen-main,app:sam-genbound,app:sam-genbound}.
Precisely, however, this result only mean for SAM and is not to be confused with the relative improvement against SGD as shown in \cref{sec:experiments-main}.
}

