\section{Evaluation}\label{sec:evaluation}
\subsection{Experimental Setup}
% \lipsum[1-1]

\paragraph{Tasks.}
We evaluate \algadamw{} and \alglion{} on language modeling tasks, specifically supervised fine-tuning (SFT) and direct preference optimization (DPO)~\citep{rafailov2023direct}. Additional experiments, including pre-training and learning rate scheduler ablation, are presented in Appendix~\ref{sec:app-evaluation} due to space constraints. Table~\ref{tab:exp-summary} summarizes the experimental settings.


\begin{table}[h]
\footnotesize
\begin{tabular}{lll}
\hline
Dataset                           & Model          & Results \\ \hline
Alpaca                            & Phi-2-2.7B     & Tab.~\ref{tab:sft}    \\
DuReader\_Robust                  & Llama2-7B-base & Fig.~\ref{fig:llama}    \\\hdashline
\multirow{4}{*}{HH-RLHF-harmless} & Phi2-2.7B      & Fig.~\ref{fig:dpo-accuracy},\ref{fig:time-to-loss},\ref{fig:hyper}\\
                                  & Qwen1.5-0.5B   & Fig.~\ref{fig:hyper}    \\
                                  & Qwen2-0.5B     & Fig.~\ref{fig:epoch}    \\
                                  & GPT2-medium    & Fig.~\ref{fig:var}     \\ \hline
\end{tabular}
\caption{Summary of Settings.}
\label{tab:exp-summary}
\end{table}

\paragraph{Baselines.}
We compare \algadamw{} and \alglion{} to AdamW and Lion, respectively. For each optimizer with period length $K$ and batch size $B$, we denote the settings as \algadamw-$K$ and AdamW-$K$ (similarly for Lion). AdamW-$K$ refers to AdamW with $K$-step gradient accumulation, resulting in effective batch size $KB$. All optimizers within each comparison group share identical hyperparameters. For the \algadamw{} and \alglion{} groups, we set $\text{lr}=2\text{e}{-6}$ and betas to $(0.9,0.95)$ and $(0.95,0.98)$, respectively, following~\cite{chen2023symbolic}.


\paragraph{Implementation.}
Experiments are conducted on a server equipped with $8\times$ NVIDIA A40 GPUs (each with $48$GB memory). We use the Swift framework~\citep{swift} and PyTorch~\citep{paszke2019pytorch}. SFT experiments use $K=8$, $B=32$; DPO tasks use $K=16$, $B=16$.


\paragraph{Metrics.}
For SFT, we report validation loss and model performance on the MMLU benchmark~\citep{hendrycks2020measuring}. For DPO, we report validation loss and classification accuracy (distinguishing accepted from rejected responses). For classification, we regard a prediction as correct if the model assigns higher probability to the accepted response.


\paragraph{Methodology of Comparison.}
To compare optimizer efficiency, we adopt two metrics: (1) \emph{data efficiency}, measured by the amount of training data processed to reach a target metric value, accounting for batch size differences; and (2) \emph{FLOPs}, representing the computation required to reach a target metric value and reflecting training speed.

\subsection{Supervised Fine-Tuning}

\begin{table}[h]
\centering
\resizebox{0.5\textwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
\multirow{2}{*}{\textbf{Algorithm}} & \multirow{2}{*}{\textbf{Val Loss}} & \multicolumn{5}{c}{\textbf{MMLU(Zero-Shot)}}                                                 \\ \cline{3-7} 
                                    &                                     & \rule{0pt}{2ex} \textbf{Hums.} & \textbf{STEM} & \textbf{Social} & \textbf{Other} & \textbf{Avg.} \\ \hline
AdamW-4                             & 0.9212                                    & \rule{0pt}{2ex}15.4           & 28.3          & 26.7            & 24.3           & 24.4          \\
AdamW-8                             & 0.9408                                    & \textbf{19.2}  & 22.8          & 26.7            & 25.0           & 23.3          \\
AdamW-PMA-4                         & 0.9352                                    & 16.9           & 22.8          & 25.0            & 22.7           & 21.9          \\
AdamW-PMA-8                         & \textbf{0.9078}                                    & 16.2           & \textbf{28.3} & \textbf{30.1}   & \textbf{35.0}  & \textbf{27.7} \\ \hdashline
Lion-4                              & 0.9227                                    & \rule{0pt}{2ex}13.1           & 23.3          & 24.2            & 25.7           & 21.8          \\
Lion-8                              & 0.9486                                    & \textbf{20.8}  & 22.2          & 24.2            & 25.0           & \textbf{23.0}          \\
Lion-PMA-4                          & \textbf{0.9136}                                    & 13.1           & \textbf{23.3} & \textbf{24.2}   & 25.7  & 21.8 \\
Lion-PMA-8                          & 0.9373                                    & 17.7           & 22.2          & 22.5            & \textbf{26.4}           & 22.3          \\ \bottomrule
\end{tabular}
}
\caption{Comparison of the validation loss and the performance after one epoch training on zero-shot MMLU for various algorithms. With limited space, we only demonstrate four representative categories.}
\label{tab:sft}
\end{table}

\begin{figure*}[t]
    \centering
    \begin{subfigure}[t]{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/dpo/dpo-accuracy/dpo_accuracy_flops_8.jpg}
        \caption{Validation accuracy on flops (K = 8)}
        \label{fig:dpo_accuracy_flops_8}
    \end{subfigure}\hfill
    \begin{subfigure}[t]{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/dpo/dpo-accuracy/dpo_accuracy_flops_16.jpg}
        \caption{Validation accuracy on flops (K = 16)}
        \label{fig:dpo_accuracy_flops_16}
    \end{subfigure}\hfill
    \begin{subfigure}[t]{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/dpo/dpo-accuracy/dpo_accuracy_numsofsamples_8.jpg}
        \caption{Validation accuracy on number of samples (K=8)}
        \label{fig:dpo_accuracy_numberofsamples_8}
    \end{subfigure}\hfill
    \begin{subfigure}[t]{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{fig/dpo/dpo-accuracy/dpo_accuracy_numsofsamples.jpg}
        \caption{Validation accuracy on number of samples (K=16)}
        \label{fig:dpo_accuracy_numberofsamples_16}
    \end{subfigure}\hfill
    \caption{The accuracy of classifying the accepted and rejected responses on the validation dataset for DPO task. Compared to AdamW and Lion, AdamW-PMA and Lion-PMA exhibit faster convergence rates and higher accuracy.}
    % Red lines in \ref{fig:dpo_accuracy_flops_8}\ref{fig:dpo_accuracy_numberofsamples_8} are sparser because of the fewer evaluation times.}
    \label{fig:dpo-accuracy}
\end{figure*}

\begin{figure*}[t]
 \centering
 \begin{minipage}[t]{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig/rebuttal/epoch.jpg}  
    \caption{Validation loss of training more epochs on DPO task.}
    \label{fig:epoch}  
 \end{minipage}
 \hfill
    \nextfloat
 \begin{minipage}[t]{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig/rebuttal/dpo.jpg}  
    \caption{Runtime to achieve the same loss on DPO task..}
    \label{fig:time-to-loss}  
 \end{minipage}
 \hfill
    \nextfloat
 \begin{minipage}[t]{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig/combined_hyper.jpg}  
    \caption{The speedup factor of AdamW-PMA compared to AdamW under different K.}
    \label{fig:hyper}  
 \end{minipage}
 \end{figure*}

\begin{figure*}[t]
 \centering
 % \hfill
    % \nextfloat
 \begin{minipage}[t]{0.95\textwidth}
    \begin{minipage}[t]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{fig/variance.png} % First image file
        \subcaption{Variance vs Training loss}
        \label{fig:var_train}
    \end{minipage}
    \hfill
    \begin{minipage}[t]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{fig/variance_time.png} % Second image file
        \subcaption{Variance vs Time}
        \label{fig:var_time}
    \end{minipage}
    \caption{Comparison of the magnitude in variance w.r.t the training loss and time. 
    % The vertical coordinates all use log scale since our algorithm is orders of magnitude different from other algorithms.
    }
    \label{fig:var}
 \end{minipage}
 % \hfill
 % \begin{minipage}[t]{0.3\textwidth}
 %    \centering
 %    \includegraphics[width=\textwidth]{fig/rebuttal/llama.jpg}  
 %    \caption{Validation loss of SFT on Llama2-7B. AdamW-PMA takes a similar time as AdamW, but achieves a much lower loss.}
 %    \label{fig:llama}  
 % \end{minipage}
 \end{figure*}


\begin{figure}[h]
    \centering
    \includegraphics[width=0.475\textwidth]{fig/rebuttal/llama.jpg} 
    \caption{Validation loss of SFT on Llama2-7B.AdamW-PMA consistently attains a much lower loss than AdamW.}
    \label{fig:llama}  
\end{figure}

Table~\ref{tab:sft} shows that \algadamw{} and \alglion{} improve SFT performance compared to their baseline counterparts. For validation loss, AdamW-PMA-8 achieves the best result among AdamW variants, while Lion-PMA-4 is best in the Lion family.

For MMLU zero-shot tasks, PMA variants outperform their EMA counterparts in all categories except Humanities, attributable to input length limitations on Phi-2. PMA-based optimizers consistently achieve higher average scores; e.g., AdamW-PMA-8 obtains an average of $27.7$, outperforming AdamW.

% Interestingly, Table~\ref{tab:sft} also reveals that algorithms scoring high in Humanities tasks tend to perform poorly in other categories. For instance, AdamW-8 achieves the highest score in Humanities (19.2) within the AdamW group but has one of the lowest overall average scores (23.3). This phenomenon is believed to be caused by the unbalanced SFT data, which lacks sufficient data in Humanities. Conversely, Lion-PMA-4, while maintaining a competitive score in Humanities (17.7), excels in other categories except for the averaged score. The low average score of Lion-PMA-4 is caused by the lack of data in Humanities, lowering the average score despite high scores in many other categories.




\subsection{Direct Preference Optimization}
\paragraph{PMA improves accuracy and reduces overfitting.}
Figure~\ref{fig:dpo-accuracy} compares the validation accuracy across flops and training samples for $K=8$ and $K=16$. AdamW is consistently the slowest and least accurate, while PMA-based optimizers yield marked improvements in both convergence speed and final accuracy. With larger $K$, AdamW-PMA achieves convergence speeds that surpass AdamW and rival Lion, while Lion-PMA delivers the strongest overall performance.

Figure~\ref{fig:epoch} presents validation loss across multiple epochs for Qwen2-0.5B on DPO. PMA-based optimizers sustain lower validation loss and are less prone to overfitting compared to EMA-based counterparts.

\paragraph{PMA reduces runtime.}
Figure~\ref{fig:time-to-loss} demonstrates that PMA-based optimizers reach a given loss target in less wall-clock time than EMA-based optimizers. Furthermore, PMA can utilize remaining data to achieve even higher accuracy once that target is met.

\paragraph{PMA is sensitive to $K$.}
We study the effect of $K$ on speedup in DPO experiments with both Phi-2 and Qwen1.5 models (see Fig.~\ref{fig:hyper}). For small $K$, PMA approximates standard AdamW, and variance reduction is minimal. For excessively large $K$, aggressive learning rate reduction can diminish acceleration. In practice, an intermediate $K$ achieves the best trade-off between speed and stability.






\subsection{Additional Properties}

\paragraph{PMA reduces update variance.}
We measure update variance using GPT-2 medium on the Alpaca dataset with $K=16$, comparing algorithms with identical configurations. Following~\citep{ash2019deep,mirzasoleiman2020coresets,killamsetty2021glister}, we use the last layer gradient as a proxy for model gradient variance. Figure~\ref{fig:var} demonstrates that PMA yields consistently lower update variance for the same loss and over training time, compared to EMA.

\paragraph{PMA scales to large models.}
On Llama2-7B trained with SFT on DuReader\_Robust~\citep{tang2020dureader_robust}, AdamW-PMA consistently achieves lower validation loss than AdamW throughout training (see Fig.~\ref{fig:llama}), demonstrating effectiveness at scale.

\paragraph{PMA incurs minimal extra per-step overhead.}
Although PMA introduces more frequent updates and slightly more communication in distributed settings, per-iteration time increases are modest: on a 7B model, AdamW-PMA is only about $2\%$ slower per iteration than AdamW.

