\section{Further merits and caveats of overparameterization}
\label{sec:practical}


\begin{figure*}[!t]
      \centering
      \begin{subfigure}{0.19\linewidth}
          \includegraphics[width=\linewidth]{figures/labelnoise/Noise rate.pdf}
          \caption{Label noise}
          \label{fig:resut_labelnoise_noiserate}
      \end{subfigure}
      \hspace{1em}
      \begin{subfigure}{0.199\linewidth}
      \includegraphics[width=\linewidth]{figures/cifar/ResNet18/largesparse_diff/largesparse_snip_dense16.pdf}
          \caption{Sparsity}
          \label{fig:result_largesparse_cifar_resnet_dense16}
      \end{subfigure}
      \hspace{1em}
      \begin{subfigure}{0.19\linewidth}
          \includegraphics[width=\linewidth]{figures/cifar/ResNet18/overparam_diff/overparamwd[0.0].pdf}
            \caption{w/o weight decay}
            \label{fig:result_overparam_diff_resnet_wo_wd}
      \end{subfigure}
      \begin{subfigure}{0.164\linewidth}
          \includegraphics[width=\linewidth,trim={1em 0 1em 0},clip]{figures/pos/overparam_diff/overparam.pdf}
            \caption{w/o early stop.}
            \label{fig:result_overparam_diff_pos_wo_es}
      \end{subfigure}
      \begin{subfigure}{0.167\linewidth}
          \includegraphics[width=\linewidth,trim={1.3em 0 0 0},clip]{figures/cifar/ViT/overparam_diff/overparam.pdf}
            \caption{w/o induc. bias}
            \label{fig:result_overparam_diff_cifar_vit}
      \end{subfigure}
      \caption{
        Effect of (a) label noise, (b) sparsity, and (c-e) regularization on SAM.
        (a) The benefit of SAM is more pronounced with a higher noise level.
        (b) The improvement by SAM tends to increase in large sparse models compared to their small dense counterparts.
        (c-e) SAM does not always benefit from overparameterization without sufficient regularization.
        See \cref{fig:result_labelnoise,fig:result_largesparse_cifar,fig:result_largesparse_mnist,fig:result_largesparse_rho_mnist,fig:result_overparam_overfit} in \cref{app:add_practical} for more results.
      }
\end{figure*}

In this section, we present further merits and some caveats of overparameterization.
Specifically, we show that the overparameterization benefit of SAM continues to exist and becomes more evident under label noise or sparsity.
We also discover that sufficient regularization is required to attain the benefit.
These results could serve as guidance to employ SAM in practice.


\vspace{-0.5em}
\paragraph{Overparameterization secures the robustness of SAM to label noise}
\label{sec:experiments-main-label_noise}



In practice, deep learning models are often trained on noisy data \citep{song2022learning}.
To examine whether the overparameterization benefit for SAM continues to exist in this scenario, we introduce some label noise to training data \citep{angluin1988learning,natarajan2013learning} and see how SAM responds.
The results are reported in \cref{fig:resut_labelnoise_noiserate}.
Overall, we find SAM benefits from overparameterization significantly more than SGD in the presence of label noise.
Precisely, the accuracy improvement made by SAM keeps on increasing as the model has more parameters, whereas the improvement over SGD is marginal for less parameterized models.
Notably, this trend is more pronounced with a higher noise level;
\eg, it rises from $5\%$ to nearly $50\%$ at the highest noise rate.
Notably, it is previously known that SAM is robust to label noise compared to SGD \citep{foret2020sharpness,baek2024sam, huang2023robust, zou2024towards, kim2023fantastic}, and yet, this result newly reveals that overparameterization plays a profound role in securing the robustness of SAM.

\vspace{-0.5em}
\paragraph{SAM benefits from sparse overparameterization.}
\label{sec:experiments-main-large_sparse}


There has been a recent interest in employing sparsity to train large models to alleviate the computation and memory costs \citep{hoefler2021sparsity,mishra2021accelerating}.
To test the effect of overparameterization on SAM under this setting, we introduce a varying degree of sparsity to an overparameterized model at initialization \citep{leesnip} such that the number of parameters matches the original dense model.
The results are reported in \cref{fig:result_largesparse_cifar_resnet_dense16}.
We observe that the generalization improvement tends to increase as the model becomes more sparsely overparameterized;
more precisely, the average accuracy improvement increases from $0.4\%$ in the small dense model to around $0.8\%$ in the large sparse model.
This result suggests that one can consider taking sparsification more actively when employing SAM.

\vspace{-0.5em}
\paragraph{Sufficient regularization is needed to secure the benefit of overparameterization.}
\label{sec:experiments-main-regularization}


We also investigate whether the overparameterization benefit for SAM continues to exist when models are prone to overfitting due to insufficient regularization \citep{ying2019overview}.
Specifically, we evaluate three cases:
(a) without weight decay,
(b) without early stopping, and
(c) without sufficient inductive bias.\footnote{We train ViTs that are not pre-trained on a massive dataset, which is known to lack inductive biases inherent to CNNs and thus more prone to overfitting \citep{lee2021vision,chen2022transmix}.}
The results are reported in \cref{fig:result_overparam_diff_resnet_wo_wd,fig:result_overparam_diff_pos_wo_es,fig:result_overparam_diff_cifar_vit}.
We observe that the generalization improvement does not increase by simply adding more parameters.
The results indicate that some level of regularization is required in practice to attain the overparameterization benefit for SAM.

