\section{Preliminaries}
\subsection{Background: First-Order Optimization}

Adam~\citep{kingma2014adam,reddi2019convergence} and AdamW~\citep{loshchilov2017decoupled} are among the most widely used optimizers for large language model (LLM) training, integrating adaptive learning rates and momentum-based techniques. Given an objective function $f : \mathbb{R}^d \to \mathbb{R}$, at iteration $t$, let the stochastic gradient be $g_t = \nabla f(x_t)$. Adam maintains exponential moving averages of the first and second moments: $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ and $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$, where $\beta_1$ and $\beta_2$ are hyperparameters, and all operations are element-wise. A small constant $\epsilon > 0$ is typically added to $v_t$ for numerical stability. To correct the bias introduced at initialization, the moment estimates are debiased: $\hat{m}_t = m_t / (1 - \beta_1^t)$ and $\hat{v}_t = v_t / (1 - \beta_2^t)$. Since historical gradients receive exponentially decaying weights, these averages are referred to as exponentially weighted moving averages (EMAs). The parameter update is then
\begin{equation}
    x_{t+1} \leftarrow x_{t} - \gamma \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon},
\end{equation}
where $\gamma$ is the learning rate. All operations are performed element-wise. AdamW decouples weight decay from the gradient update. More recently, Lion~\citep{chen2023symbolic} has been proposed, which relies solely on the first moment and updates parameters using EMA.

Despite their effectiveness, Adam-based methods encounter difficulties in high-variance settings, particularly under memory constraints. Training LLMs is intrinsically a high-variance optimization problem~\citep{mccandlish2018empirical}. To mitigate variance, practitioners commonly increase the batch size using high-performance clusters~\citep{touvron2023llama1}. Conversely, reducing batch size further amplifies the stochasticity of gradient estimates, slowing convergence and degrading optimization, especially in memory-limited environments~\citep{yuan2016influence,bottou2018optimization,kunstner2023noise,fu2023and}.

\subsection{Theoretical Solution: Variance Reduction in SGD}\label{sec:variance-reduction}

Here we review representative approaches for reducing gradient variance in stochastic optimization~\citep{bottou2018optimization}, alongside their limitations. \emph{Variance reduction} techniques typically reuse historical information to construct lower-variance gradient estimates. For instance, SVRG~\citep{DBLP:conf/nips/Johnson013} maintains a snapshot parameter $\theta_k$ (with $k < t$) and leverages it for gradient correction. The iterate averaging method~\citep{polyak1991} averages the iterates across steps to yield a final estimate. Recent advances such as SARAH~\citep{nguyen2017sarah} and STORM~\citep{cutkosky2019momentum} adopt recursive update rules that avoid explicit storage of past gradients.

% \paragraph{Drawbacks.}
However, these variance reduction methods exhibit practical deficiencies in the LLM context. Approaches like SAGA incur prohibitive memory costs by requiring storage of a gradient for every data sample, with memory usage scaling with dataset size. Iterate averaging demands storing all historical parameter vectors, incurring memory overhead proportional to the number of steps. SVRG relies on large-batch computations at each snapshot, increasing memory requirements. Although SARAH and STORM mitigate storage needs by not retaining past gradients explicitly, they require multiple backpropagation passes per parameter update, which substantially increases computational cost during LLM training.

\subsection{Practical Solution: Gradient Accumulation}\label{sec:ga-dp}

To address high gradient variance under memory constraints, \emph{gradient accumulation} (GA)\footnote{While memory-efficient optimizers such as those proposed in~\citep{shazeer2018adafactor,luo2023came,zhao2024galore,zhang2024adam} are also viable, we argue that GA remains more practical for our scenario. Detailed discussion appears in \S~\ref{sec:related-work}.} provides a straightforward solution. GA divides a large batch into $K$ smaller micro-batches processed sequentially, accumulating gradients computed on each without exceeding the device memory limit. After accumulating gradients over the $K$ micro-batches, the optimizer averages them to approximate the gradient over the full batch and performs a parameter update. Notably, the gradient accumulated via GA is mathematically equivalent to that obtained from a large batch.

Despite its statistical soundness, GA presents substantial practical drawbacks, primarily in terms of increased wall-clock time. That is, GA achieves memory efficiency by trading off parallelism for serial computation: on resource-limited devices, each parameter update requires $K$ successive forward and backward passes, leading to lower computational efficiency than if the large batch could be processed in parallel. Recent variants such as~\cite{pham2023combined} further reduce memory usage in GA, but do not alleviate the increased training time on memory-constrained hardware.


% Another method to utilize a large batch size on memory-limited GPUs is the Distributed Data Parallel (DDP) method. The procedure of DP is as follows: the model parameters are copied onto multiple GPUs, and the large training batch is divided into smaller batches, each assigned to a different GPU. Each GPU independently performs feed-forward and back-propagation for several rounds of GA. Before updating the parameters, the GPUs synchronize the accumulated gradients by communicating with each other and computing the mean of the gradients~\cite{gibiansky2017}. Thus, the GA process with DDP does not need more communication overhead, and the communication only happens when a parameter update is required.

% \benyou{why do we need to propose a new approach in Sec. 3}
% \yumou{Are we going to restate the technical challenge here? Since the challenges have been stated in the Introduction, is it necessary to repeat them? Why or why not?}
% \response{Response: }

\subsection{The Dilemma}
\label{sec:Dilemma}

The discussions in Section~\ref{sec:variance-reduction} and Section~\ref{sec:ga-dp} reveal a fundamental dilemma between theoretically grounded and practically feasible approaches for high-variance optimization under GPU memory constraints. On the one hand, state-of-the-art variance reduction methods frequently incur prohibitive memory costs, rendering them unsuitable for resource-limited environments. On the other hand, practical strategies either converge slowly due to elevated gradient variance or require substantially increased training time, as in the case of the sequential computations inherent to gradient accumulation (GA). Crucially, this dilemma stems from the persistence of high variance in stochastic gradients. Therefore, there is a pressing need for novel methods that can reduce the variance of parameter updates without incurring additional memory or computational overhead.
