\section{Background} 
\label{sect:2_Background}

Let us consider the general unconstrained optimization problem:
\vspace{0.5em}
\begin{equation} \label{eq:optimization}
    \min_x f(x)
\end{equation}
\vspace{0.5em}
where $f: \R^d \rightarrow \R$ is the objective function to minimize, and $x \in \R^d$ is the optimization variable.
Based on recent studies that indicate a strong correlation between the sharpness of $f$ at a minimum and its generalization error \citep{keskar2017on,dziugaite2017computing,jiang2019fantastic}, \citet{foret2020sharpness} suggests to turn (\ref{eq:optimization}) into a min-max problem of the following form
\begin{equation} \label{eq:sam-objective}
    \min_x \max_{\lVert\epsilon\rVert_2\leq\rho} f(x + \epsilon)
\end{equation}
where $\epsilon$ and $\rho$ denote some perturbation added to $x$ and its norm bound, respectively.
Thus, the goal is now to seek $x$ that minimizes $f$ in its $\epsilon$-neighborhood, such that the objective landscape becomes locally flat.
Taking the first-order Talyor approximation of $f$ at $x$ and solving for optimal $\epsilon^\star$ gives the following update rule for SAM:
\begin{equation} \label{eq:sam-step}
    x_{t+1} = x_{t} - \eta \nabla f \left(x_t+\rho\frac{\nabla f(x_t)}{\lVert\nabla f(x_t)\rVert_2}\right).
\end{equation}
SAM has been shown to be effective for improving generalization performance compared against SGD \citep{chenvision, kaddour2022flat, bahri2022sharpness}, and subsequent works have analyzed various aspects of SAM under different settings including its convergence rates \citep{andriushchenko2022towards, mi2022make, si2023practical} and implicit bias \citep{compagnoni2023sde, wensharpness,andriushchenko2024sharpness}.

Meanwhile, a considerable amount of evidence has indicated the benefit of overparameterization for training neural networks.
Besides the empirical success witnessed across different domains \citep{kaplan2020scaling,radford2021learning,dehghani2023scaling}, overparameterization turns all local minima into global ones in theory enabling local methods to succeed under non-convex settings \citep{kawaguchi2016deep,du2019gradient}.
Researchers have also proved the power of overparameterization to enable much faster convergence \citep{ma2018power,vaswani2019fast,meng2020fast} and better generalization \citep{allen2019learning,brutzkus2019larger}.
To our knowledge, however, previous work has mostly focused on non-sharpness-aware optimizers, and the effects of overparameterization on SAM has been left rather unattended despite its contemporary significance to large-scale training trends and widespread usage in practice.
 