
\section{Other effects of overparameterization: Theoretical aspects} \label{sec:theory}

Thus far, we have focused on empirically exploring how an increasing number of parameters influences SAM, and discovered critical improvements in its generalization benefits.
However, existing theoretical analyses on overparameterization also hint at other types of positive influences on different aspects of SAM such as convergence \citep{ma2018power,vaswani2019fast} and implicit bias \citep{neyshabur2017implicit,zhang2017understanding}.
Despite this, we find that there is little work on explicitly verifying whether these influences extend to SAM, however.

To fill this gap, we develop theoretical analyses of the effect of overparameterization on SAM\footnote{We use an unnormalized version of SAM: $x_{t+1} = x_{t} - \eta \nabla f \left(x_t+\rho\nabla f(x_t)\right)$, an empirically similar variant of SAM often adopted to simplify proofs \citep{andriushchenko2022towards, compagnoni2023sde}.} in this section.
Specifically, we show that (i) linearly stable minima for SAM have more uniform Hessian moments compared to SGD (\cref{sec:sam-stability}), and (ii) SAM can converge much faster (\cref{sec:convergence}), all when the model is overparameterized.

To characterize overparameterization, we adopt a widely accepted definition: a model is overparameterized if it possesses more parameters than necessary to fit the entire training data or achieve zero training loss \citep{ma2018power, belkin2018understand, belkin2019reconciling, neyshabur2019role, nakkiran2020deep, nakkiran2020optimal}---that is, any model capable of interpolation.
We formalize this via the following \emph{interpolation} assumption:
\begin{definition} \label{def:interpolation}
    (Interpolation) Let $f(x)=\sum^n_{i=1} f_i(x)$. There exists $x^\star$ s.t. $f_i(x^\star)=0$ and $\nabla f_i(x^\star)=0$ for $i=1,\hdots,n$.
\end{definition}
Crucially, this implies that there exists a fixed point $x^\star$ for stochastic gradient-based optimizers, which comes as an important property in the following two sections.


We leave a clear note here that the aim of these analyses is to complement, rather than directly support \cref{sec:experiments-main,sec:understanding}, by outlining theoretically guaranteed benefits of overparameterization on SAM.
We discuss more about the limitations later in \cref{sec:discussion}.


\subsection{SAM escapes sharp minima with non-uniform Hessian}
\label{sec:sam-stability}


Here we demonstrate that SAM escapes until it encounters minima of a certain level of flatness and uniform Hessian moments that are stricter compared to SGD.
To this end, we employ linear stability analysis \citep{wu2018sgd, wu2022alignment}, which aims to derive specific conditions a minimum should satisfy in order for a given optimizer to remain stable and not escape from it.

We first define linear stability as follows:
\begin{definition} \label{def:linear-stability}
    (Linear stability)
    Consider a general iterative first-order optimizer $x_{t+1} = x_t - G(x_t)$.
    A minimizer $x^\star$ is called linearly stable if there exists a constant $C$ such that
    $$\mathbb{E}[\| \tilde{x}_t-x^\star\|^2] \leq C \| \tilde{x}_0-x^\star\|^2$$
    for all $t > 0$ under the linearized dynamic near $x^\star$: $\tilde{x}_{t+1} = \tilde{x}_t -  \nabla G(x^\star) (\tilde{x}_t - x^\star)$, \ie, if it does not deviate far from $x^\star$ once arrived near a fixed point.
\end{definition}
Here, the linearized dynamic $\tilde{x}_t$ appears when the iterate $x_t$ approaches sufficiently near $x^\star$ such that the loss becomes approximately quadratic, with the existence of the fixed point $x^\star$ implied by the interpolation assumption in \cref{def:interpolation}.


With this, we provide the stability condition that minima should satisfy for a stochastic SAM to converge in the following theorem:
\begin{theorem} \label{thm:sam-stability} 
    Let us assume $x^\star=0$ without loss of generality.
    Then $x^\star$ is linearly stable for a stochastic SAM if the following is satisfied:
    \begin{equation} \label{eq:sam-stability-condition}
    \begin{split}
     \lambda_{\textup{max}} & \left((I - \eta H - \eta \rho H^2)^2 + \eta(\eta-2\rho) (M_2-H^2) \right.\\
     & \hspace{1em} \left.  + 2\eta^2\rho (M_3-H^3) + \eta^2\rho^2 (M_4 -H^4) \right) \leq 1
    \end{split}
    \end{equation}
    where $H = \frac{1}{n} \sum_{i=1}^n H_i$ and $M_k = \frac{1}{n}\sum_{i=1}^n H_i^k$ are the average Hessian and the $k$-th moment of the Hessian at $x^\star$ over $n$ training data.
    Subsequently as a necessary condition of (\ref{eq:sam-stability-condition}) it follows that
    \begin{align} \label{eq:stability-necessary}
        \begin{split}
            & 0 \leq a (1 + \rho a) \leq \frac{2}{\eta}, \quad 0 \leq s_2^2 \leq \frac{1}{\eta (\eta - 2 \rho)}, \quad \\
            & 0 \leq s_3^3 \leq \frac{1}{2 \eta^2 \rho}, \quad 0 \leq s_4^4 \leq \frac{1}{\eta^2 \rho^2},
        \end{split}
    \end{align}
    where $a = \lambda_{\text{max}}(H), s_k = \lambda_{\text{max}}((M_k - H^k)^{1/k})$ are the sharpness and the non-uniformity of the Hessian measured with the $k$-th moment, respectively.
\end{theorem}

The detailed proof of the theorem is provided in \cref{app:sam-stability}.

Our result (\ref{eq:stability-necessary}) suggests that SAM requires less sharp minima and more uniformly distributed Hessian moments to achieve linear stability (provided that $\rho > 0$) compared to those of SGD \citep{wu2018sgd}, \ie, when $\rho\rightarrow 0$ in (\ref{eq:stability-necessary}).
While a similar result is shared by a concurrent work of \citet{behdin2023msam}, we further ensure that higher-order terms of Hessian moments are bounded, and interestingly, it becomes tighter for a larger $\rho$.
To corroborate our result, we measure the empirical sharpness and non-uniformity of Hessian.
The results are reported in \cref{fig:landscape,fig:stability_uniformity}.


\begin{figure*}[!t]
  \begin{subfigure}{0.32\linewidth}
      \centering
      \includegraphics[width=0.48\linewidth,trim={1.3cm 0cm 0.8cm 0cm},clip]{figures/mnist/landscape/eigenvectorMLP_[3000, 1000]_sgd_0.0_0.0_random_0.0_seed_1_dim_3d_dir_seed_0.pdf}
      \includegraphics[width=0.48\linewidth,trim={1.3cm 0cm 0.8cm 0cm},clip]{figures/mnist/landscape/eigenvectorMLP_[3000, 1000]_usam_0.0_0.2_random_0.0_seed_1_dim_3d_dir_seed_0.pdf}
      \caption{Landscape (SGD vs. SAM)}
      \label{fig:landscape}
  \end{subfigure}
  \hspace*{\fill}
  % \begin{subfigure}{0.23\linewidth}
  \begin{subfigure}{0.16\linewidth}
      \centering
      \includegraphics[width=\linewidth,trim={0.1cm 0.5cm 0.1cm 0},clip]{figures/mnist/eigen/nonuniformity_neuron[3000, 1000].pdf}
      \vspace{-1.2em}
      \caption{Non-uniformity}
      \label{fig:stability_uniformity}
  \end{subfigure}
  \begin{subfigure}{0.5\linewidth}
      \centering
      \includegraphics[width=0.32\linewidth, trim={1em 1.2em 1em -3em}, clip]{figures/synth/matrix_factor/convergence/usam-rank_4_10.pdf}
      \includegraphics[width=0.32\linewidth, trim={1em 1.2em 1em 0}, clip]{figures/mnist/convergence/sam-num_neurons_30_10_300_100_3000_1000.pdf}
      \includegraphics[width=0.32\linewidth, trim={1em 1.2em 1em 0}, clip]{figures/cifar/ResNet18/convergence/sam-num_filters_4_16_64.pdf}
      \vspace{-0.3em}
      \caption{Convergence}
      \label{fig:convergence}
  \end{subfigure}
  \caption{
    (a) Loss landscapes of SGD (left) and SAM (right) along with the corresponding sharpness $a = \lambda_{max}(H)$. SAM converges to flatter minima with lower sharpness compared to SGD.
    (b) Non-uniformity of Hessian for SGD and SAM. SAM has a more uniform Hessian distribution than SGD.
    (c) Convergence properties of SAM. As model becomes overparameterized, SAM converges much faster and closer to a linear rate.
    See \cref{app:exp_details_theory} for the experiment details.}
  \label{fig:stability_experiments}
  \vspace{-1.0em}
\end{figure*}

\subsection{Stochastic SAM converges much faster with overparameterization}
\label{sec:convergence}


Prior works have revealed the power of overparameterization for stochastic optimization methods to accelerate convergence \citep{ma2018power,vaswani2019fast,meng2020fast}.
We prove that this benefit also extends to a stochastic SAM.


Besides the interpolation assumption we defined earlier in \cref{def:interpolation}, let us start by providing some assumptions used below.
\begin{definition}
    (Smoothness) $f$ is $\beta$-smooth if there exists  $\beta>0$ s.t. $\| \nabla f (x) -\nabla f (y) \| \leq \beta \| x-y \|$ for all $x,y  \in \mathbb R ^d$.
\end{definition}
\begin{definition}
    (Polyak-Lojasiewicz) $f$ is $\alpha$-PL if there exists $\alpha > 0$ s.t. $\| \nabla f (x) \|^2 \geq \alpha (f(x)-f(x^\star))$ for all $x  \in \mathbb R ^d$.
\end{definition}


The smoothness and the Polyak-Lojasiewicz (PL) assumptions are standard and used frequently in optimization \citep{gower2020variance, meng2020fast, nutini2022let, karimi2016linear}.
The smoothness assumption is satisfied for any neural network with smooth activation and loss function with bounded inputs \citep{andriushchenko2022towards}, and the PL condition is argued to be satisfied when the model is overparameterized \citep{belkin2021fit, liu2022loss}, which we empirically verify in \cref{fig:resut_empirical_pl} of \cref{app:emp-measure}.

Under these assumptions, we present the following convergence theorem of a stochastic SAM:

\begin{theorem}\label{thm:main-PL-stochSAM}
Suppose each $f_i$ is $\beta$-smooth, $f$ is $\lambda$-smooth and $\alpha$-PL, and interpolation holds.
For any $\rho\leq \frac{1}{(\beta/\alpha + 1/2)\beta}$, a stochastic SAM that runs for $t$ iterations with constant step size $\eta^\star \defeq \frac{\alpha-(\beta + \alpha/2)\beta\rho}{2\lambda\beta(\beta\rho+1)^2}$ gives the following convergence guarantee:
\begin{equation*}\label{eq:theorem-PL-stochastic-SAM}
\ex{x_t}{f(x_t)}\leq
\left(1 - \frac{\alpha-(\beta + \alpha/2)\beta\rho}{2} \, \eta^\star\right)^t\,f(x_0).
\end{equation*}
\end{theorem}

We provide the full proof in Appendix \ref{app:prooflinconv}, which also contains results for the more general case of a mini-batch SAM.

This result shows that with overparameterization, a stochastic SAM can converge as fast as the deterministic gradient method at a linear convergence rate, which is much faster than the well-known sublinear rate of $\mathcal{O}(1/t)$ for SAM \citep{andriushchenko2022towards}.
Also, our analysis suggests that convergence is guaranteed without the bounded variance assumption and diminishing step size under overparameterization, while without overparameterization, convergence does not hold \citep{andriushchenko2022towards}.
This suggests that overparameterization can significantly ease the convergence of SAM.
We corroborate our result empirically as well, by measuring how training proceeds with overparameterization in realistic settings.
The results are plotted in \cref{fig:convergence}.

