\section{Additional Theoretical Analysis}

In this section, we provide a theoretical analysis on the convergence property of \algadamw. Specifically, we focus on the convergence properties concerning the number of large update steps. This focus is due to the time cost between two large steps being approximately equal to the time between two updates of Adam with GA. During the analysis, we slightly modify the notations for ease of analysis. Unlike Algorithm~\ref{alg:agma}, where the index of small update steps ranges from $0$ to $K-1$, in the subsequent analysis, this index ranges from $1$ to $K$. Specifically, when $\tau=K$, the update step from $x_{t,K}$ to $x_{t+1,1}$ is considered a large update step for all $t$. For the other $\tau\in[K-1]$, the subsequent update step is a small step.

Firstly, we can show the average regret of \algadamw~ converges based on Theorem~\ref{thm:regret},
\begin{corollary}
Assume that the optimization objective $f$ is convex and has bounded gradients, $\|\nabla f(x)\|_2\leq G$, $\|\nabla f(x)\|_{\infty}\leq G_{\infty}$, and the distance between any parameter generated by \algadamw~ is bounded, $\|x_{t_1,\tau_1}-x_{t_2,\tau_2}\|_2\leq D$, $\|x_{t_1,\tau_1}-x_{t_2,\tau_2}\|_{\infty}\leq D_{\infty}$ for any $t_1,t_2\in [T]$ and $\tau_1,\tau_2\in [K]$, and $\beta_1$, $\beta_2$ satisfy $\frac{\sqrt{1-\beta_2}}{1-\beta_1}\leq 1$. \algadamw~ achieves the following regret guarantee, for all $T\geq 1$.
\begin{equation*}
    \frac{R_K(T)}{T}=O\left(\frac{1}{\sqrt{T}}\right).
\end{equation*}
\end{corollary}

Then, we provide the update size between two large update steps in general non-convex settings.
\begin{theorem}\label{thm:large-step-size}
Assume that the objective function $f$ is $L$-smooth, the step size between two large update steps is bounded by
\begin{equation}\label{eq:large-step-size}
    % f(x_{t,1})-f(x^*)\leq \frac{LB}{2} + \left(1+\frac{\gamma^2L}{K} \frac{(1-\beta_1)^2}{1-\beta_2}(K+1)\right)\cdot \bar{\zeta}\frac{1-(2a)^{2t-2}}{1-4a^2}.
    % \|x_{t+1,1}-x_{t.1}\|^2\leq \frac{2}{L} \left(1+\frac{\gamma^2L}{K} \frac{(1-\beta_1)^2}{1-\beta_2}(K+1)\right)\cdot \bar{\zeta}(2a)^{2t-2},
    \|x_{t+1,1}-x_{t,1}\|^2\leq \frac{2}{L} \left(1+\frac{\gamma^2L}{\sqrt{K}} \frac{(1-\beta_1)^2}{1-\beta_2}(K+1)\right)\cdot \bar{\zeta}(2a)^{2t-2},
\end{equation}
where $\bar{\zeta}$ and $a$ are constants, and $\bar{\zeta}(2a)^{2t-2}\geq \zeta(2a)^{2t-2}+\frac{c}{1-4a^2} + K^2$, and $a=\frac{\beta_1(1-\beta_1)}{\sqrt{\beta_2(1-\beta_2)}}\cdot \frac{1}{\sqrt{K}}$.
\end{theorem}

Theorem~\ref{thm:large-step-size} indicates that the distance between two large update steps is bounded and converges to $0$. Despite having $K$ small updates with varying momentum averaging weights, the step sizes still converge rapidly
% . Moreover, Equation~\ref{eq:large-step-size} reveals that the constant term before the exponential term of the step size is about $O(\sqrt{K})$
, suggesting the validity of setting the learning rate of the small steps to be $\gamma/\sqrt{K}$. Furthermore, the exponential term decreases with $K$, aligning with the intuition that more small update steps lead to faster convergence.