\section{Methodology: Periodical Moving Average}\label{sec:method}

To address the dilemma identified in Section~\ref{sec:Dilemma}, we propose \algname{} as an enhancement of the exponential moving average (EMA) process to more effectively reduce gradient variance. Section~\ref{sec:method-idea} presents the high-level design and intuition underlying \algname{}, including its connections and distinctions to existing work. Section~\ref{sec:method-detail} provides detailed implementation, with an emphasis on dynamics of $\beta$ and learning rate scheduling, while Section~\ref{sec:method-case} describes the integration of \algname{} with AdamW and Lion.

\subsection{High-Level Idea}\label{sec:method-idea}

\paragraph{Mimicking GA in Momentum Updates.}
At a high level, \algname{} simulates the variance reduction of gradient accumulation (GA) within momentum-based optimizers. Unlike standard EMA, which exponentially discounts past gradients, our method maintains uniform weighting for recent gradients within fixed periods. By partitioning training iterations into periods and employing a vanilla moving average for momentum updates within each period, \algname{} mimics the effect of GA, providing moment estimates of lower variance analogous to those obtained by EMA-based optimizers using GA (\S\ref{sec:moment-update-rule}).

\paragraph{From Pure Accumulation to Progressive Updates.}
Whereas standard GA does not update parameters until the end of each accumulation period, \algname{} interleaves updates within each period. Specifically, \algname{} alternates between steps with large and small learning rates: we designate the former as \textit{large update steps}, typically at the culmination of a period, and the latter as \textit{small update steps}. Each large update step, taken after $K$ small update steps, emulates the behavior of EMA-based optimizers with GA, while the intervening small update steps facilitate faster convergence. By judiciously choosing a reduced learning rate for these small steps, we seek to accelerate optimization without destabilizing the variance reduction effect (\S\ref{sec:learning-rate-strategy}).

\subsection{Detailed Design}\label{sec:method-detail}

We now describe the update rules governing the first moment (momentum) as an illustrative example.

\subsubsection{Momentum Update: Dynamics of $\beta$}\label{sec:moment-update-rule}

Unlike conventional EMA, where $\beta$ is fixed, \algname{} employs a dynamically adjusting $\beta$ to achieve a uniform moving average within each accumulation period.\footnote{Here, $\beta$ denotes the weight of the previous momentum; $1-\beta$ is the weight of the current gradient. For clarity, we retain the notation $\beta$ throughout.} Uniform weighting requires systematically decaying $\beta$ during each period so that each historical gradient within the period contributes equally.

We describe momentum updates for both the large and small update steps. See Fig.~\ref{fig:beta} for a visualization.

\paragraph{Large Update Step: Low Gradient Weight for Variance Control.}
At the first small update step following a large update step ($\tau=0$), the momentum update is 
\[
    m_t \gets \beta_1 m_{t-1} + (1-\beta_1) \frac{g_t}{K},
\]
where $K$ denotes the accumulation length. Unlike EMA, here the current gradient is scaled down by $1/K$ and, after the update, the first and second momenta are scaled by $K$. This design both reduces the variance of the momentum estimates and, after $K$ steps, ensures the accumulated gradients receive equal weighting, consistent with the behavior of GA.

\paragraph{Small Update Steps: Uniform Moving Average via Dynamic Weights.}
For subsequent small steps within the period ($\tau=1,\ldots,K-1$), the momentum is updated as
\[
    m_t \gets \frac{\tau}{\tau+1} m_{t-1} + \frac{1-\beta_1}{\tau+1} g_t.
\]
This procedure, in essence, replaces the EMA with a vanilla moving average, ensuring that the gradients from each small step contribute equally in $m_{t, K-1}$. Notably, in $m_{t, K-1}$, $m_{t-1, K-1}$ receives weight $1-\beta_1$, and each $g_{t, \tau}$ (for all $\tau$) is weighted by $\frac{1-\beta_1}{K}$, conforming with the GA weighting scheme.

\input{contents-uai-cr/pseudo_code}
\begin{figure}[t]
\centering
    \begin{subfigure}[t]{0.23\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/lr/beta_schedule.png}
        \caption{Dynamic $\beta$ ($K = 8$, $\beta=0.9$)}
        \label{fig:beta}
    \end{subfigure}\hfill
    \begin{subfigure}[t]{0.23\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/lr/lr_schedule.png}
        \caption{Learning rate schedule ($K = 8$, $\gamma=1$)}
        \label{fig:lr}
    \end{subfigure}\hfill
    \caption{Illustrations of dynamic $\beta$ and learning rate scheduling.}
    \label{fig:lr-beta}
\end{figure}

\subsubsection{Learning Rate Schedule}\label{sec:learning-rate-strategy}

Complementing the momentum update, we design a learning rate schedule that distinguishes between small and large update steps. Specifically, small update steps employ a linearly decaying learning rate rather than a fixed schedule. As shown in Fig.~\ref{fig:lr}, this mitigates the risk of excessive parameter advancement that could undermine the variance reduction effect intended by GA-mimicking. For step $\tau$, the effective learning rate is given by
\[
    \eta_\tau = \frac{\eta}{\tau+1},
\]
where $\eta$ is the base learning rate. Such linear decay ensures that later small steps cause less drift from the virtual GA reference point, preserving the intended statistical behavior.

\subsection{Case Study}\label{sec:method-case}
\subsubsection{From AdamW to \algadamw}
To obtain \algadamw{}, we substitute the EMA used in AdamW’s first and second moment estimators with the periodical moving average scheme detailed above. The pseudo-code is provided in Algorithm~\ref{alg:agma}. Both momenta are scaled at large update steps, while at small steps, the learning rate is effectively scaled by $1/\sqrt{K}$, owing to the combination of momentum and learning rate scaling. All remaining components of AdamW remain unchanged.

\subsubsection{From Lion to \alglion}
A similar modification applies to Lion. The vanilla moving average with dynamic scheduling replaces the EMA rule. Because Lion lacks a second moment estimate, the small step learning rate is decayed by $1/K$, not $1/\sqrt{K}$ as in AdamW. This aligns with the rescaled momentum at large update steps, ensuring a linearly decreasing learning rate as described above. All other Lion components are retained unchanged. Pseudocode for \alglion{} is presented in Appendix~\ref{sec:app-pseudo-code}.

% End of section
