% \section{Introduction}
% Scaling up Large Language Models (LLMs) has been empirically established as a crucial method for enhancing their capabilities~\citep{radford2019language,kaplan2020scaling,brown2020language,hoffmann2022training,zhang2022opt,touvron2023llama1,touvron2023llama,achiam2023gpt,bi2024deepseek}. Each stage of LLM post-training—including Supervised Fine-Tuning (SFT) and reinforcement-learning-based training~\citep{openaio1}—incurs substantial computational costs, typically requiring GPU-rich clusters~\citep{Lee_2022}. However, scaling exacerbates GPU memory demands, creating significant challenges for memory-constrained devices.  

% Existing approaches for training LLMs on memory-limited devices suffer from inefficient time complexity. A naive solution involves reducing batch sizes, but this introduces excessive gradient noise, hindering convergence. Alternatively, Gradient Accumulation (GA) mitigates memory constraints by performing multiple backpropagation steps and averaging gradients before each parameter update, effectively simulating a larger batch size. However, GA replaces parallel processing with sequential computation, inevitably increasing training duration.  

% \begin{figure*}[t]
%     \centering
%     \begin{subfigure}{0.22\textwidth}
%         \centering
%         \includegraphics[width=\linewidth]{fig/introduction/introduction_sft_AdamW_8_flops.PNG} % First image file
%         \caption{AdamW v.s. \texttt{AdamW-PMA} on SFT.}
%         \label{fig:intro-sft-loss-adamw}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}{0.22\textwidth}
%         \centering
%         \includegraphics[width=\linewidth]{fig/introduction/introduction_sft_Lion_8_flops.PNG} % Second image file
%         \caption{Lion v.s. \texttt{Lion-PMA} on SFT.}
%         \label{fig:intro-sft-loss-lion}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}{0.22\textwidth}
%         \centering
%         \includegraphics[width=\linewidth]{fig/introduction/introduction_dpo_AdamW_16_flops.PNG} % Second image file
%         \caption{AdamW v.s. \texttt{AdamW-PMA} on DPO.}
%         \label{fig:intro-dpo-loss-adamw}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}{0.22\textwidth}
%         \centering
%         \includegraphics[width=\linewidth]{fig/introduction/introdution_dpo_Lion_16_flops.PNG} % Second image file
%         \caption{Lion v.s. \texttt{Lion-PMA} on DPO.}
%         \label{fig:intro-dpo-loss-lion}
%     \end{subfigure}
%     \caption{Optimizers with \algname ~achieves about $2\times$ speedup compared with the optimizers with EMA. (\ref{fig:intro-sft-loss-adamw}\ref{fig:intro-sft-loss-lion}) Comparison of the number of steps needed to achieve the same validation loss with Phi-2 2.7B model on SFT task and Alpaca dataset. (\ref{fig:intro-dpo-loss-adamw}\ref{fig:intro-dpo-loss-lion}) Comparison of the number of steps needed to achieve the same validation loss with Phi-2 2.7B model on DPO task and HH-RLHF dataset. \yumou{some of the figures are placeholders here.}}
%     \label{fig:intro}
% \end{figure*}


% In this paper, we propose \textit{Periodical Moving Average} (\algname), a momentum update method designed to accelerate momentum-based optimizers for LLM training on memory-constrained devices while addressing the dual challenge of minimizing variance and computational overhead. 

% The technical limitations of existing approaches present a clear trade-off: Gradient Accumulation (GA) cannot incorporate additional parameter updates without either interrupting its accumulation process or demanding extra memory, while small-batch training suffers from high gradient variance that destabilizes parameter updates. 
% Our key insight stems from the observation that post-training typically employs lower learning rates than pre-training, causing parameters to evolve gradually without deviating significantly from the pre-trained model. This implies that gradients across consecutive steps exhibit similar statistical expectations. 
% \algname{} addresses these challenges by partitioning the training process into discrete periods of $K$ steps each. Within each period, it replaces the conventional Exponential Moving Average (EMA)~\citep{kingma2014adam,loshchilov2017decoupled,chen2023symbolic,liu2023sophia} with a simple moving average for momentum updates, thereby reducing variance. Between periods, it reverts to standard EMA to maintain the rapid convergence properties of established optimizers. This hybrid approach achieves an optimal balance between stability and efficiency.



% % \subsection{Trajectory Deviation and Mitigation}
% A critical challenge in applying \algname{} to existing optimizers is \textit{trajectory deviation} - the potential divergence of parameter updates from the optimal path due to uniform weighting of gradients within each period. This occurs because the optimizer lacks access to the true expectations of stochastic gradients, making deviation detection and correction inherently difficult.
% To address this, we introduce a \textit{periodic learning rate decay} strategy: the learning rate linearly decreases within each period while resetting to its initial value at period boundaries. This approach constrains gradient variations, ensuring update stability across periods while maintaining the desired optimization trajectory.

% By applying \algname, we modify AdamW~\citep{loshchilov2017decoupled} and Lion~\citep{chen2023symbolic}, proposing \algadamw ~and \alglion. To verify the effectiveness of \algadamw ~and \alglion, we conduct extensive experiments covering the post-training process of an LLM, including Supervised Fine-training (SFT) and Direct Preference Optimization (DPO)~\citep{rafailov2023direct} on GPT-2~\citep{brown2020language}, Phi-2~\citep{javaheripi2phi}, Qwen1.5~\citep{qwen1.5}, Qwen2~\citep{yang2024qwen2} and Llama2~\citep{touvron2023llama}. Empirical evaluation shows that \algadamw{} and \alglion{} achieve approximately $2\times$ speedup in the post-training process and deliver better performance on downstream tasks. Furthermore, we provide a theoretical analysis of \algadamw ~on the learning rate strategy and regret bound, demonstrating that the theoretical convergence properties of \algadamw ~are on par with those of Adam.

% Our technical contributions are summarized as follows:
% \begin{itemize}
%     \item We propose Periodical Moving Average (\algname), a momentum update method to accelerate GA for LLM post-training on GPU-memory-limited devices. We adopt \algname ~to AdamW and Lion, to propose \algadamw ~and \alglion. 
%     Both algorithms stabilize the training and cost no more memory and computation overhead in each step, achieving the same level of loss with less time and less data compared to the original algorithms.
%     \item We conduct extensive experiments across model sizes (from 0.1B to 7B) and training tasks (SFT and DPO) to evaluate the performance of \algadamw ~and \alglion. \algname{}-enhanced methods achieve approximately $2\times$ speedup than GA in the post-training process and deliver better performance on downstream tasks~\footnote{Implementation available at \url{https://github.com/liuyumou/periodical-moving-average.git}}.
%     \item We provide theoretical analyses of \algadamw. The convergence analysis of the small update steps shows the correctness of our designed learning rate strategy.
% \end{itemize}

\section{Introduction}

Scaling large language models (LLMs) has consistently propelled advances in model capability and generalization~\citep{radford2019language,kaplan2020scaling,brown2020language,hoffmann2022training,zhang2022opt,touvron2023llama1,touvron2023llama,achiam2023gpt,bi2024deepseek}. Modern post-training stages---including supervised fine-tuning (SFT) and reinforcement learning from human feedback~\citep{openaio1}---exert substantial computational and memory burdens, typically necessitating powerful multi-GPU clusters~\citep{Lee_2022}. However, scaling intensifies GPU memory constraints, presenting major challenges for training or adaptation on memory-limited devices.

Practical strategies for training LLMs under memory constraints are hampered by efficiency limitations. Simple batch size reduction increases gradient variance, significantly slowing convergence. Alternatively, gradient accumulation (GA) alleviates memory bottlenecks by splitting a large batch into multiple smaller updates, accumulating gradients over several micro-batches before a single parameter update. While GA emulates large-batch training, it transforms parallel computation into sequential updates, substantially elongating training time.

\begin{figure*}[t]
    \centering
    \begin{subfigure}{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/introduction/introduction_sft_AdamW_8_flops.PNG}
        \caption{AdamW vs. \texttt{AdamW-PMA} on SFT.}
        \label{fig:intro-sft-loss-adamw}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/introduction/introduction_sft_Lion_8_flops.PNG}
        \caption{Lion vs. \texttt{Lion-PMA} on SFT.}
        \label{fig:intro-sft-loss-lion}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/introduction/introduction_dpo_AdamW_16_flops.PNG}
        \caption{AdamW vs. \texttt{AdamW-PMA} on DPO.}
        \label{fig:intro-dpo-loss-adamw}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/introduction/introdution_dpo_Lion_16_flops.PNG}
        \caption{Lion vs. \texttt{Lion-PMA} on DPO.}
        \label{fig:intro-dpo-loss-lion}
    \end{subfigure}
    \caption{Optimizers with \algname{} achieve approximately $2\times$ speedup compared to EMA-based optimizers. (\ref{fig:intro-sft-loss-adamw}, \ref{fig:intro-sft-loss-lion})~show SFT validation loss on Phi-2~2.7B with Alpaca; (\ref{fig:intro-dpo-loss-adamw}, \ref{fig:intro-dpo-loss-lion})~show DPO validation loss on Phi-2~2.7B with HH-RLHF.}
    \label{fig:intro}
\end{figure*}

In this work, we introduce the \emph{Periodical Moving Average} (\algname{}), a novel momentum update scheme tailored to accelerate post-training of LLMs on memory-limited hardware by addressing both variance and computational overhead. 

Existing approaches involve a fundamental trade-off: GA cannot interleave parameter updates within accumulation periods without losing memory efficiency, while small-batch training suffers from high variance and unstable parameter updates. Our central insight is that, during post-training, LLM parameters evolve gradually due to low learning rates, leading to consecutive gradients with highly similar expectations. This motivates rethinking the exponential moving average (EMA): while EMA discounts historical gradients exponentially and thus rapidly forgets useful information, a properly designed moving average can better utilize recent history for stabilization.

\algname{} partitions training into discrete periods of $K$ steps each. Within each period, it replaces EMA with a simple moving average in the momentum update, uniformly weighting recent gradients to reduce variance. Between periods, it resets to standard EMA, thus preserving rapid convergence and optimizer stability. This hybrid mechanism achieves a favorable trade-off between memory efficiency and convergence speed.

A notable challenge with \algname{} is \emph{trajectory deviation}, wherein uniform averaging within a period may cause parameter updates to diverge from the ideal optimization path, due to lack of access to the true gradient expectation. To mitigate this, we employ a \emph{periodic learning rate decay}: the learning rate linearly decreases within each period and resets at period boundaries, ensuring that parameter updates remain stable and aligned with the underlying optimization trajectory.

We instantiate \algname{} within AdamW~\citep{loshchilov2017decoupled} and Lion~\citep{chen2023symbolic}, resulting in \algadamw{} and \alglion{}, respectively. We thoroughly evaluate these variants across SFT and Direct Preference Optimization (DPO)~\citep{rafailov2023direct} tasks on a range of models, including GPT-2~\citep{brown2020language}, Phi-2~\citep{javaheripi2phi}, Qwen1.5~\citep{qwen1.5}, Qwen2~\citep{yang2024qwen2}, and Llama2~\citep{touvron2023llama}. Our experiments show that \algadamw{} and \alglion{} achieve approximately $2\times$ speedup over traditional EMA-based optimizers, while attaining better downstream task performance. We further provide theoretical analysis of the learning rate strategy and regret bound, establishing that \algadamw{} retains the convergence guarantees of Adam.

\textbf{Our key contributions are:}
\begin{itemize}
    \item We propose Periodical Moving Average (\algname{}), a momentum update mechanism that accelerates large-batch emulation for LLM post-training on memory-limited devices. When applied to AdamW and Lion (\algadamw{} and \alglion{}), our method stabilizes training and reduces training time and data requirements, without incurring extra per-step memory or computation.
    \item We provide comprehensive empirical validation across models ranging from 0.1B to 7B parameters and for both SFT and DPO settings. PMA-based optimizers realize up to $2\times$ speedup over GA, consistently outperforming baselines in downstream evaluation~\footnote{Implementation available at \url{https://github.com/liuyumou/periodical-moving-average.git}}.
    \item We theoretically analyze the convergence of \algadamw{}, highlighting the effectiveness of the dynamic learning rate strategy and establishing regret guarantees on par with standard Adam.
\end{itemize}
