\section{Test error of SAM can decrease with overparameterization}
\label{sec:theory-gen-main}

Recent works have shown that overparameterization can even improve generalization both empirically and theoretically \citep{neyshabur2017exploring,brutzkus2019larger}.
Here, we present that overparameterization also improves generalization for SAM in the sense that test error can decrease with larger network widths (and thus more parameters).

We follow the same setting of \citet{allen2019learning}.
Specifically, we consider a risk minimization over some unknown data distribution $\mathcal{D}$ using a one-hidden-layer ReLU network with a smooth convex loss function (\eg, cross entropy).
The network is assumed to be initialized with Gaussian and take bounded inputs.
Then, we characterize a generalization property of a stochastic SAM as below.

\begin{theorem}\label{thm:sam-genbound-main}
(Informal)
Suppose we train a network having $m$ hidden neurons with training data sampled from $\D$.
Then, for every $\varepsilon$ in some open interval, there exists $M \propto 1/\varepsilon$ such that for every $m \geq M$, with appropriate values of $\eta, \rho, T$, a stochastic SAM gives the following guarantee on the test loss with high probability:
\begin{equation*}\label{eq:sam-genbound}
\ex{x_0, \cdots, x_{T-1}}{ \frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}_{\mathcal{D}} [f(x_t)] } \leq \varepsilon.
\end{equation*}
\end{theorem}

We present a formal version of the theorem and its proof in \cref{app:sam-genbound}.

This result suggests that to achieve $\varepsilon$-test accuracy from running $T$ iterations of SAM requires a minimum width $M$ proportional to $1/\varepsilon$.
This indicates that a network with a larger width can achieve a lower test error, and hence, overparameterization can improve generalization for SAM.

\paragraph{Experiment}

\begin{wrapfigure}{r}{0.3\linewidth}
  \vspace{-1em}
  \resizebox{\linewidth}{!}{
  \centering
  \includegraphics[width=\linewidth, trim={0.8em 0.8em 0.8em 0}, clip]{figures/synth/generalization/sam.pdf}
  }
  \caption{
    Generalization of SAM. Test error keeps on decreasing with a larger number of neurons.
  }
  \label{fig:stability_experiments}
  \vspace{-1.0em}
\end{wrapfigure}

We support this result empirically on synthetic data for a simple regression task.
Specifically, following the setup of \citet{allen2019learning}, we train $2$-layer ReLU networks with synthetic data.
Here, each element of the input $x = (x_1, x_2, x_3, x_4) \in \R^4$ for synthetic data is sampled from random Gaussian distribution and then normalized to satisfy $\| x \|_2 = 1$, and target $y$ is calculated as $y = (\sin(3x_1) + \sin(3x_2) + \sin(3x_3)-2)^2 \cdot \cos (7x_4)$. 
The weights and biases of the first layer are initialized from $\mathcal{N}(0, 1/m)$ where $m$ is the number of hidden neurons, and the weights of the second layer are initialized from $\mathcal{N}(0, 1)$.
We only train the weights of the first layer for $800$ epochs, while the biases of the first layer and the weights of the second layer are frozen to initialized values.
We use $1000$ and $5000$ data points for training and testing respectively.
We use a batch size of $50$ without weight decay and decay learning rate by $0.1$ after $50\%$ of the total epochs.
We perform the grid search over learning rate and $\rho$ from $\{10^{-k} \vert 2 \leq k \leq 7\}$ and $\{10^{-k} \vert 1 \leq k \leq 5\}$ respectively.



\section{Proof of Theorem \ref{thm:sam-genbound-main}}

\label{app:sam-genbound}

In this section, we provide the formal version of \cref{thm:sam-genbound-main} and its proof.

\subsection{Notation and Setup}

Throughout this section, we use the same notations and setups as  \citet{allen2019learning}.
We remark that the notations are different from those used in \cref{sec:theory-gen-main,app:prooflinconv,app:sam-stability}.

First, let us assume the unknown data distribution $\D$ where each data $z = (x, y)$ consists of the input $x \in \R^d$ and the corresponding label $y \in \Y$.
We also assume, without loss of generality, that $\| x\|_2 = 1$ and $x_d = 1/2$.
The loss function  $L: \R^k \times \Y \rightarrow \R$ is assumed to be non-negative, convex, $1$-Lipschitz continuous, and $1$-smooth with respect to its first argument.

Next, we define the target network $F^* = (f^*_1, \cdots, f^*_k): \R^d \rightarrow \R^k$ as 
\begin{equation} \label{eq:targetfn}
    f^*_r(x) \eqdef \sum_{i=1}^p a^*_{r,i} \phi_i(\ip{w^*_{1,i}}{x}) \ip{w^*_{2,i}}{x}
\end{equation}
where each $\phi_i: \R \rightarrow \R$ is an infinite-order smooth function.
Here, we assume that $\|w^*_{1,i} \|_2 = \|w^*_{2,i}\|_2 = 1, | a^*_{r,i} | \leq 1$ hold for all $i \in \{1, \cdots, p\}$.
We denote the sample and network complexity of $\phi$ as $\compsam$  and $\compnet$ respectively (see Section 2 of \citet{allen2019learning} for the formal definitions).
Suppose we have a concept class $\C$ that consists of all functions $F^*$ with bounded number of parameters $p$ and complexity $\mathfrak{C}$.
We also denote the population risk achieved by the best target function $F^*$ in this concept class as $\opt$, \ie, $\opt = \underset{F^\star \in \C}{\min} \E_{(x, y) \sim \D}[L(F^*(x), y]$

Then, we define the learner network $F = (f_1, \cdots, f_k): \R^d \rightarrow \R^k$ as below.
\begin{equation} \label{eq:relu-net}
    f_r(x) \eqdef \sum_{i=1}^m \init{a}_{r,i} \operatorname{ReLU} (\ip{w_i}{x} + \init{b}_i).
\end{equation}
Note that the learner network is a $2$-layer ReLU network with $m$ neurons.
We train the network with $n$ sampled data sampled from $\D$ and denote it as $\Z = \{z_1, \cdots, z_N\}$.
We only train the weights $W = (w_1, \cdots, w_m) \in \R^{m \times d}$ and freeze the values of $a, b$ during the training.
We denote the initial value of the weight and its value at time $t$ as $\initw$ and $ \initw + W_t$ respectively.
Each element of $\initw$ and $\init{b}$ are initialized from $\mathcal{N}(0, 1/m)$ while each element of $\init{a}_r$ are initialized from $\mathcal{N}(0, \varepsilon_a^2)$ for some fixed $\varepsilon_a \in (0, 1]$.
At each step $t$, we sample a single data point $z = (x, y)$ from $\Z$ and update $W$ using un-normalized version of SAM:
\begin{align} \label{eq:samdef-gen}
    W_{t+1} &= W_t - \eta \nabla L(F(x; \initw + W_{t + 1/2}), y) \notag \\
    &= W_t - \eta \nabla L(F(x; \initw + \rho \nabla L(F(x; \initw + W_t), y)), y).
\end{align}

\subsection{Formal Theorem}

Now, we are ready to present the formal version of \cref{thm:sam-genbound-main} below.

\begin{theorem} \label{thm:sam-genbound-formal} (SAM version of Theorem 1 in \citet{allen2019learning})
    For every $\varepsilon \in \left(0, \frac{1}{p k \compsam (\phi, 1)}\right)$, there exists $M_0 = \poly(\compnet(\phi, 1), 1/\varepsilon)$ and  $N_0 = \poly(\compsam(\phi, 1), 1/\varepsilon)$ such that for every $m \geq M_0$ and every $N \geq \widetilde{\Omega}(N_0)$, by choosing $\varepsilon_a = \varepsilon / \widetilde{\Theta}(1)$ for the initialization and $\eta = \wttheta(\frac{1}{\varepsilon k m}), \rho = \wttheta(\frac{1}{\varepsilon^3 k m^3}), T = \wttheta \left( \frac{(\compsam(\phi, 1))^2 \cdot k^3p^2}{\varepsilon^2} \right)$, running $T$ iterations of stochastic SAM defined in \cref{eq:samdef-gen} gives the following generalization bound with high probability over the random initialization.
    \begin{equation} \label{eq:sam-gen-formal}
        \E_{\text{SAM}} \left [ \frac{1}{T} \sum_{t=0}^{T-1} \E _{(x, y) \sim \D} L(F(x; \initw + W_t), y) \right] \leq \opt + \varepsilon.
    \end{equation}
\end{theorem}

Here, the notation of $\wt{O}(\cdot)$ ignores the factor of $\textsf{polylog}(m)$.

\subsection{Proof of Theorem \ref{thm:sam-genbound-formal}}

We here present the proof of \cref{thm:sam-genbound-formal}.

First, note that we can directly use the algorithm-independent part from \citet{allen2019learning}. 
Thus, it is sufficient to show that the similar version of Lemma B.4 in \citet{allen2019learning} also holds for SAM.

We first define the function $G = (g_1, \cdots, g_k): \R^d \rightarrow \R^k$ as similar to \citet{allen2019learning}. 
\begin{equation} \label{eq:def-g}
    g_r(x; W_t) \eqdef \sum_{i=1}^m \init{a}_{r,i} (\ip{\itert{w}_i}{x} + \init{b}_i) \Id [\ip{\init{w}_i}{x} + \init{b}_i \geq 0].
\end{equation}

Then, the following corollary holds for a stochastic SAM from Lemma B.3 of \citet{allen2019learning}.
The corollary presents an upper bound on the norm of differences between $\frac{\partial}{\partial W} L(F(\cdot), y)$ and $\frac{\partial}{\partial W} L(G(\cdot), y)$.
\begin{corollary} \label{corollary:sam-coupling-main}
    (SAM version of Lemma B.3 in \citet{allen2019learning})
    Let $\tau = \varepsilon_a (\eta + \rho) t$.
    Then, for every $x$ satisfying $\| x \|_2 = 1$, and for every time step $t \geq 1$, the following are satisfied with high probability over the random initialization. \\
    (a) For every $r \in [k]$, 
    \begin{equation*}
        \left \lvert f_r(x; \init{W} + W_t) - g_r (x; \init{W} + W_t) \right \rvert = \wt{O}(\varepsilon_a k \tau^2 m^{3/2})
    \end{equation*} 
    (b) For every $y \in \Y$, 
    \begin{equation}
        \left \| \frac{\partial}{\partial W} L(F(x; \initw + W_t), y) - \frac{\partial}{\partial W} L(G(x; \initw + W_t), y) \right \|_{2,1} \leq \wt{O}(\varepsilon_a k \tau m^{3/2} + \varepsilon_a^2 k^2 \tau^2 m^{5/2})
    \end{equation}
\end{corollary} 

Next, we present the key lemma integral to our proof.
The part $(c)$ will be directly used in the proof and presents an upper bound on the norm of differences between SAM gradient and SGD gradient for $F$.

\begin{lemma} \label{lemma:sam-coupling-samupdate}
    For every $x$ satisfying $\| x \|_2 = 1$, and for every time step $t \geq 1$, the following are satisfied with high probability over the random initialization. \\
    (a) For at most $\wt{O}(\varepsilon_a \rho \sqrt{km})$ fraction of $i \in [m]$: we have 
    \begin{equation*}
        \Id [\ip{\iterthalf{w}_i}{x} + \init{b}_i \geq 0] \neq \Id [ \ip{\itert{w}_i}{x} + \init{b}_i \geq 0].
    \end{equation*}
    (b) For every $r \in [k]$, 
    \begin{equation*}
        \left \lvert f_r(x; \init{W} + W_{t + 1/2}) - f_r (x; \init{W} + W_t) \right \rvert = \wt{O}(\varepsilon_a^3 k \rho^2 m^{3/2} + \varepsilon_a^2 \sqrt{k} \rho m)
    \end{equation*} 
    (c) For every $y \in \Y$, 
    \begin{align}
        & \left \| \frac{\partial}{\partial W} L(F(x; \initw + W_{t + 1/2}), y) - \frac{\partial}{\partial W} L(F(x; \initw + W_t), y) \right \|_{2,1} \notag \\ 
        & \leq \wt{O}(\varepsilon_a^2 k \rho m^{3/2} + \varepsilon_a^4 k^2 \rho^2 m^{5/2} + \varepsilon_a^3 k^{3/2} \rho m^2)
    \end{align}
\end{lemma}

\begin{proof}
    Recall that the following hold from the definition of $F$ (see Lemma B.3 of \citet{allen2019learning} for the details). \\
    \begin{equation} \label{eq:gen-gradient-bound}
        \left \| \frac{\partial}{\partial w_i} f_r (x; \initw + W_t) \right \|_2 \leq \varepsilon_a B \quad \text{and} \quad \left\| \frac{\partial}{\partial w_i} L(F(x; \initw + W_t), y) \right \|_2 \leq \sqrt{k} \varepsilon_a B
    \end{equation}
    (a) 
    Let $\tau = \varepsilon_a \rho$ and define $\calH \eqdef \left \{ i \in [m] \bigg\Vert \left \lvert \ip{\itert{w}_i}{x} + \init{b}_i \right \rvert \geq 2 \sqrt{k} B \tau \right \}$.
    Then, the lemma is a direct corollary from Lemma B.3 (a) of \citet{allen2019learning}. \\
    (b) 
    We divide $i$ into two cases.
    First, when $i \notin \calH$, we can directly utilize Lemma B.3.(b) of \citet{allen2019learning} and the total difference from these $i$'s is $\wt{O}(\varepsilon_a^3 k \rho^2 m^{3/2})$.
    Next, we consider the differences from $i \in \calH$.
    \begin{align*}
        &\quad \left \lvert \init{a}_{r,i} \left( \left\langle \iterthalf{w}_i, x \right \rangle + \init{b}_i \right) \Id \left[\left\langle \iterthalf{w}_i, x \right \rangle + \init{b}_i \geq 0\right] \right. \\ 
        & \hspace{2em} \left. - \init{a}_{r,i} \left( \left\langle \itert{w}_i, x \right \rangle + \init{b}_i \right) \Id \left[\left\langle \itert{w}_i, x \right \rangle + \init{b}_i \geq 0\right] \right \rvert \\
        &\leq \left \lvert \init{a}_{r,i} \left( \left\langle \iterthalf{w}_i - \itert{w}_i, x \right \rangle \right) \right \rvert \\
        &= \left \lvert \init{a}_{r,i} \left (\left \langle \rho \cdot \frac{\partial}{\partial w_i} L(F(x; \initw + W_t), y), x \right \rangle \right ) \right \rvert \\
        & \leq \rho \left \lvert \init{a}_{r,i} \right \rvert \cdot \left \| \frac{\partial}{\partial w_i} L(F(x; \initw + W_t), y)  \right \|_2 \cdot \| x \|_2 \\
        &\leq \rho (\varepsilon_a B) \cdot (\sqrt{k} \varepsilon_a B) \\
        &= \wt{O}(\varepsilon_a^2 \sqrt{k} \rho)
    \end{align*}
    The first inequality is from the fact that $i \in \calH$ and thus $\Id \left[\left\langle \iterthalf{w}_i, x \right \rangle + \init{b}_i \geq 0\right] = \Id \left[\left\langle \itert{w}_i, x \right \rangle + \init{b}_i \geq 0\right]$.
    Then, we have utilized the definition of SAM (\ref{eq:samdef-gen}) and Cauchy-Schwartz inequality.
    Since there can be at most $m$ number of $i \in \calH$, the total differences from $i \in \calH$ amount to $\wt{O}(\varepsilon_a^2 \sqrt{k} \rho m)$.
    Combining the two cases proves the (b). \\
    (c)
    By the chain rule, we have 
    \begin{equation*}
        \frac{\partial}{\partial w_i} L(F(x; \initw + W_t), y) = \nabla L(F(x; \initw + W_t), y) \frac{\partial}{\partial w_i} F(x; \initw + W_t).
    \end{equation*}
    Since $L$ is $1$-smooth, applying the above lemma (b) gives 
    \begin{align} \label{eq:gen-lipschitz-smoothness}
        & \hspace{1.1em} \left \| \nabla L(F(x; \initw + W_{t + 1/2}), y) - \nabla L(F(x; \initw + W_t), y) \right \|_2 \notag \\
        &\leq \left \| F(x; \initw + W_{t + 1/2}) - F(x; \initw + W_t) \right \|_2 \notag \\
        &\leq \wt{O} \left(\varepsilon_a^3 k^{3/2} \rho^2 m^{3/2} + \varepsilon_a^2 k \rho m \right).
    \end{align}
    
    For $i \in \calH$, we have $\Id [\ip{\iterthalf{w}_i}{x} + \init{b}_i \geq 0] = \Id [ \ip{\itert{w}_i}{x} + \init{b}_i \geq 0]$ and thus $\frac{\partial}{\partial w_i} F(x; \initw + W_{t + 1/2}) = \frac{\partial}{\partial w_i} F(x; \initw + W_t)$.
    Then, combining (\ref{eq:gen-lipschitz-smoothness}) with (\ref{eq:gen-gradient-bound}) and using the fact that there can be at most $m$ number of $i \in \calH$, this amounts to $\wt{O} \left(\varepsilon_a^4 k^2 \rho^2 m^{5/2} + \varepsilon_a^3 k^{3/2} \rho m^2 \right)$.

    Next, for $i \notin \calH$, we can directly use the result from Lemma B.3.(c) of \citet{allen2019learning} and this contributes to $\wt{O}(\varepsilon_a^2 k \rho m^{3/2})$.
    Summing these together, we prove the bound.
\end{proof}


Finally, we show that the following lemma holds, which is a SAM version of Lemma B.4 in \citet{allen2019learning}.
Combined with the algorithm-independent parts presented in \citet{allen2019learning}, proving the following lemma concludes the proof of \cref{thm:sam-genbound-formal}.
We use the notation of $L_F(\Z; W)$ for $L_F(\Z; W) \eqdef \frac{1}{|\Z|} \sum_{(x, y) \in \Z} L (F(x; W + \initw), y)$ and similarly define $L_G(\Z; W)$.

\begin{lemma} \label{lemma:sam-gen-optim} (SAM version of Lemma B.4 in \citet{allen2019learning})
    For every $\varepsilon \in \left(0, \frac{1}{p k \compsam(\phi, 1}\right)$, letting $\varepsilon_a = \varepsilon / \wttheta(1)$, $\eta = \wttheta(\frac{1}{\varepsilon k m})$, and $\rho = \wttheta(\frac{1}{\varepsilon^3 k m^3})$, there exists $M = \poly (\compnet(\phi, 1), 1/\varepsilon)$ and $T = \Theta\left(\frac{k^3 p^2 \cdot \compsam(\phi, 1)^2}{\varepsilon^2}\right)$
    such that if $m \geq M$, the following holds with high probability over random initialization.
    \begin{equation} \label{eq:gen-optimbound}
        \frac{1}{T} \sum_{t=0}^{T-1} L_F (\Z, W_t) \leq \opt + \varepsilon.
    \end{equation}
\end{lemma}
\begin{proof}
    Let $\refw$ be the weights constructed from the Corollary B.2 in \citet{allen2019learning}.
    By the convexity of $L$ and Cauchy-Schwartz inequality, we have 
    \begin{align*}
        L_G(\Z, W_t) - L_G(\Z; \refw) &\leq \ip{\nabla L_G(\Z; W_t)}{W_t - \refw} \\
        &= \ip{\nabla L_G(\Z; W_t) - \nabla L_F (\Z; W_t)}{W_t - \refw} \\
        &\quad + \ip{\nabla L_F(\Z; W_t) - \nabla L_F(\Z; W_{t+1/2})}{W_t - \refw} \\
        &\quad + \ip{\nabla L_F(\Z; W_{t+1/2})}{W_t - \refw} \\
        &\leq \| \nabla L_G(\Z; W_t) - \nabla L_F (\Z; W_t) \|_{2,1} \| W_t - \refw\|_{2,\infty} \\
        &\quad + \| \nabla L_F(\Z; W_t) - \nabla L_F(\Z; W_{t+1/2}) \|_{2,1} \| W_t - \refw\|_{2,\infty} \\
        &\quad + \ip{\nabla L_F(\Z; W_{t+1/2})}{W_t - \refw}
    \end{align*}

    From the SAM update rule (\ref{eq:samdef-gen}), we have the following equality. 
    \begin{align*}
        \| W_{t+1} - \refw \|_F^2 &= \| W_t - \eta \nabla L_F (\itert{z}, W_{t+1/2}) - \refw \|_F^2 \\
        &= \| W_t - \refw\|_F^2 -2 \eta \ip{\nabla L_F(\itert{z}, W_{t + 1/2})}{W_t - \refw} + \eta^2 \| \nabla L_F (\itert{z}, W_{t + 1/2})\|_F^2.
    \end{align*}

    Thus, we have 
    \begin{align*}
        L_G(\Z; W_t) - L_G(\Z; \refw) 
        & \leq \underbrace{\| \nabla L_G(\Z; W_t) - \nabla L_F (\Z: W_t) \|_{2,1} \|W_t - \refw\|_{2, \infty}}_{(A)} \\
        &\quad + \underbrace{\| \nabla L_F(\Z; W_t) - \nabla L_F(\Z; W_{t+1/2}) \|_{2,1} \| W_t - \refw\|_{2,\infty}}_{(B)} \\
        &\quad + \frac{\| W_t - \refw\|_F^2 - \E_{\itert{z}}[\| W_{t+1} - \refw\|_F^2]}{2\eta} \\
        &\quad + \underbrace{\frac{\eta}{2} \| \nabla L_F (W_{t+1/2}, \itert{z})\|_F^2}_{(C)}.
    \end{align*}

    Since $\| W_t - \refw\|_{2,\infty} = \wt{O}(\sqrt{k}\varepsilon_a (\eta + \rho) t + 
    \frac{k p C_0}{\varepsilon_a m})$, $(A)$ is bounded as 
    \begin{equation*}
        (A) = \wt{O}\left(\sqrt{k}\varepsilon_a (\eta + \rho) T \Delta + \frac{kpC_0}{\varepsilon_a m} \Delta \right)
    \end{equation*}
    where $\Delta = \wt{O}\left(\varepsilon_a^2 k (\eta + \rho) T m^{3/2} + \varepsilon_a^4 k^2 (\eta + \rho)^2 T^2 m^{5/2}\right)$.

    Next, we can bound $(B)$ from \cref{lemma:sam-coupling-samupdate}(c) as follows.
    \begin{equation*}
        (B) = \wt{O}(\sqrt{k} \varepsilon_a (\eta + \rho) T \Delta' + \frac{k p C_0}{\varepsilon_a m}\Delta'),
    \end{equation*}
    where $ \| \nabla L_F(\Z; W_t) - \nabla L_F(\Z; W_{t+1/2}) \|_{2,1} \leq \Delta' = \varepsilon_a^2 k \rho m^{3/2} + \varepsilon_a^4 k^2 \rho^2 m^{5/2} + \varepsilon_a^3 k^{3/2} \rho m^2$.

    We also have
    \begin{equation*}
        (C) = \wt{O}(\eta \varepsilon_a^2 k m)
    \end{equation*}
    since the norm of $\nabla L_F$ is always bounded as $\| \nabla L_F (\cdot, \itert{z})\|_F^2 = \wt{O}(\varepsilon_a^2 k m)$.

    Then, by telescoping, we have 
    \begin{align*}
        \frac{1}{T} \sum_{t=0}^{T-1} \E_{\text{SAM}} [L_G(\Z; W_t)] - L_G(\Z; \refw) 
        &\leq \wt{O}\left(\sqrt{k}\varepsilon_a (\eta + \rho) T \Delta + \frac{kpC_0}{\varepsilon_a m} \Delta \right) \\
        &+ \wt{O}\left(\sqrt{k} \varepsilon_a (\eta + \rho) T \Delta' + \frac{k p C_0}{\varepsilon_a m}\Delta'\right) \\
        &\quad + \underbrace{\frac{\| W_0 - \refw\|_F^2 }{2 \eta T}}_{(D)} + \wt{O}(\eta \varepsilon_a^2 k m).
    \end{align*}

    We can bound $(D)$ in the same way as \citet{allen2019learning}, 
    \begin{equation*}
        (D) = \frac{\| W_0 - \refw\|_F^2 }{2\eta T} = \wt{O}\left(\frac{k^2 p^2 \compsam(\phi, 1)^2}{\varepsilon_a^2 m} \cdot \frac{\1}{\eta T} \right).
    \end{equation*}

    By setting $\eta = \wttheta(\frac{\varepsilon}{km\varepsilon_a^2}), \rho = \wttheta(\frac{\varepsilon}{km^3 \varepsilon_a^4}), T = \wttheta(k^3 p^2 \compsam(\phi, 1)^2 / \varepsilon^2)$, we have 
    $\Delta = \wt{O}(\frac{k^6 p^4 \compsam(\phi, 1)^4}{m^{3/2} \varepsilon^4})$ and
    $\Delta' = \wt{O}(\frac{1}{m^{3/2} \varepsilon} + \frac{\sqrt{k}}{m})$.
    Hence, with large enough $m$, we obtain the following inequality and prove \cref{lemma:sam-gen-optim}, combined with the remaining parts from \citet{allen2019learning}.
    \begin{equation}
        \frac{1}{T} \sum_{t=0}^{T-1} \E_{\text{SAM}} [L_G(\Z; W_t)] - L_G(\Z; \refw) \leq O(\varepsilon).
    \end{equation}
\end{proof}