\section{Related Work}\label{sec:related-work}

\paragraph{First-Order Adaptive Optimizers.}
First-order adaptive methods form the backbone of modern deep learning optimization. Algorithms such as AdaGrad~\citep{duchi2011adaptive} adapt learning rates for each parameter based on historical gradient information, assigning larger updates to infrequent features. RMSProp~\citep{hinton2012neural} extends AdaGrad by maintaining an exponential moving average of squared gradients. Adam~\citep{kingma2014adam}, which further introduces an EMA of the first moment, and its variant with decoupled weight decay, AdamW~\citep{loshchilov2017decoupled}, are now the predominant optimizers for large-scale neural networks, especially Transformers~\citep{vaswani2017attention}. Numerous subsequent methods build on this foundation, including variants such as AdaFactor~\citep{shazeer2018adafactor}, Adam with Nesterov momentum~\citep{dozat2016incorporating}, Adabelief~\citep{zhuang2020adabelief}, Adan~\citep{xie2022adan}, Lion~\citep{chen2023symbolic}, and GrAMS~\citep{cao2024grams}. Despite their effectiveness, these algorithms often incur high memory costs due to storing additional first and second moment estimates, posing significant challenges for training large models on memory-limited devices.

\paragraph{Memory-Efficient Optimizers.}
To address memory bottlenecks, several optimizers have been proposed. AdaFactor~\citep{shazeer2018adafactor} approximates the second-moment matrix using row and column factors. LOMO~\citep{lv2023full} streamlines the update and gradient computation to reduce transient storage. CAME~\citep{luo2023came} employs residual-based adaptive updating, and GaLore~\citep{zhao2024galore} applies low-rank gradient projections to save memory. Adam-mini~\citep{zhang2024adam} further trims learning rate-related state in Adam. However, most of these methods entail a trade-off, sacrificing convergence rates or providing only moderate memory reduction (e.g., CAME saves $12.1\%$ over Adam per~\cite{luo2023came}), which remains inadequate for training LLMs under stringent memory constraints. Thus, techniques like gradient accumulation (GA) remain indispensable in memory-limited settings, further motivating acceleration of these approaches.

\paragraph{Variance Reduction.}
Variance reduction techniques in SGD are essential for accelerating convergence. Among these, dynamic sampling, gradient aggregation, and iterate averaging have received particular attention~\citep{bottou2018optimization}. SVRG~\citep{DBLP:conf/nips/Johnson013} and SAGA~\citep{DBLP:conf/nips/DefazioBL14} leverage historical gradient or parameter states to construct lower-variance stochastic gradient estimators, albeit at increased memory or computational costs. Iterate averaging~\citep{polyak1991} returns the average of parameters across SGD iterates; Nesterov's accelerated techniques~\citep{nesterov2013introductory} further establish $O(1/t)$ convergence for such schemes. However, the practical application of these methods to LLMs is limited by their overhead in storage or per-iteration computation.

% \paragraph{Second-Order and Zeroth-Order Methods.}
% Second-order optimizers augment gradient steps with curvature information~\cite{broyden1970convergence, nesterov2006cubic}, although their application in deep learning often requires Hessian approximations to be tractable~\cite{martens2010deep, martens2015optimizing, ba2016distributed}. Recently, Sophia~\citep{liu2023sophia} and AdaHessian~\citep{yao2021adahessian} applied diagonal or low-rank second-order statistics to scale second-order optimization to larger models. Orthogonally, zeroth-order (gradient-free) optimizers have been adapted to fine-tune LMs in memory-efficient ways~\cite{malladi2023fine,gautam2024variance}, though their sample complexity and empirical performance still lag behind first-order methods, especially for nontrivial LLM workloads.

% The dominance of first-order adaptivity and practical memory constraints continue to spur innovation, including our Periodical Moving Average, which accelerates and stabilizes GA for LLM post-training on limited hardware.

