\section{Theoretical Analysis}\label{sec:theoretical-analysis}

In this section, we provide theoretical guarantees for \algadamw{}, demonstrating that it achieves a convergence rate comparable to Adam, with similar resource overhead.

\subsection{Convergence Analysis}
We analyze the convergence properties of \algadamw{} under the same assumptions as~\citet{kingma2014adam}. As a performance metric, we consider the \emph{regret}:
\begin{equation}\label{eq:regret-def}
    R_\tau(T) = \sum_{t=1}^T \left[ f(x_{t,\tau}) - f(x^*) \right],
\end{equation}
where $\tau$ denotes the index of the small update steps.
This quantity measures the cumulative sub-optimality of the iterates $x_{t,\tau}$ compared to the global optimum $x^*$ over $T$ periods.

For the analysis, we modify the notational convention so that the index $\tau$ of the small update steps ranges from $1$ to $K$. The large update step, corresponding to $\tau=K$, transitions from $x_{t,K}$ to $x_{t+1,1}$; the remaining cases $\tau \in [K-1]$ correspond to the small update steps. We define $g_{t,\tau} = \nabla f(x_{t,\tau})$, with $g_{t,\tau,i}$ denoting its $i$th component.

\begin{theorem}\label{thm:regret}
Suppose the objective $f$ is convex, with $\|\nabla f(x)\|_2 \leq G$ and $\|\nabla f(x)\|_\infty \leq G_\infty$ for all $x$. Assume that, for any $(t_1,\tau_1), (t_2,\tau_2) \in [T]\times[K]$, the parameter differences satisfy $\|x_{t_1,\tau_1} - x_{t_2,\tau_2}\|_2 \leq D$ and $\|x_{t_1,\tau_1} - x_{t_2,\tau_2}\|_\infty \leq D_\infty$. Further suppose the hyperparameters satisfy $\frac{\sqrt{1-\beta_2}}{1-\beta_1}\leq 1$. Then, for any $T\geq1$, \algadamw{} achieves:
\begin{equation}
\begin{aligned}
    R_K(T) \leq\, & \frac{\sqrt{K} D^2}{2\gamma (1-\beta_1)}  \sum_{i=1}^d \sqrt{T\, \hat{v}_{T,K,i}} \\
    &+ \frac{(1+\gamma) K^{3/2} G_\infty}{2(1-\beta_1)} \sum_{i=1}^d \|g_{1:KT,i}\|_2\\
    &+ \frac{D^2_\infty G_\infty (K-1)}{2(1-\beta_1)}.
\end{aligned}
\end{equation}
\end{theorem}

The proof of Theorem~\ref{thm:regret} is provided in Appendix~\ref{sec:proof}. This result establishes that \algadamw{} attains a regret bound of $O(\sqrt{T})$, matching the order obtained for Adam~\citep{kingma2014adam}, and confirming its convergence guarantee in the convex setting.

Additionally, Theorem~\ref{thm:regret} reveals that, for fixed $T$, the cumulative regret increases as $K$ (the number of small steps) increases. This reflects a fundamental trade-off: larger $K$ can result in faster initial convergence, but overly large $K$ may cause the update trajectory to deviate from that of Adam, potentially leading to larger regret in the long term.

% Uncomment if you wish to include the corollary on average regret:
% \begin{corollary}
% Under the same conditions as Theorem~\ref{thm:regret}, the average regret of \algadamw~ converges to zero at the rate
% \[
%     \frac{R_K(T)}{T} = O\left(\frac{1}{\sqrt{T}}\right).
% \]
% \end{corollary}

\subsection{Resource Overhead Analysis}

We now discuss the computational and memory overhead of \algadamw{}. Since \algadamw{} is based on AdamW, we focus on a comparison with AdamW employing gradient accumulation.

\algadamw{} does not incur any additional memory overhead compared to AdamW or AdamW with gradient accumulation. The memory required by \algadamw{} consists of the storage for model parameters, gradients, and the running estimates of the first and second moments---identical to Adam and AdamW. When the same micro-batch size is used, the memory footprint of \algadamw{} and AdamW with GA are equivalent.

% If multi-GPU/distributed context is relevant, discuss:
% However, in distributed or multi-GPU settings, \algadamw{} may introduce additional communication overhead, especially for large $K$, due to the more frequent parameter synchronization necessitated by intra-period updates. Specifically, since \algadamw{} performs $K$ parameter updates for each period of $K$ micro-batches, the communication cost may scale linearly with $K$, potentially leading to reduced efficiency compared to AdamW with GA in highly parallelized environments.

