\section{Experiment}
\begin{table*}
\vspace{-0.1cm}
\begin{centering}
\scalebox{0.81}{
\begin{tabular}{l | llll | llll l}
\toprule
\multirow{2}{*}{Method} & \multicolumn{4}{c|}{FM} & \multicolumn{4}{c}{DeepFM}\tabularnewline
\cline{2-9} \cline{3-9} \cline{4-9} \cline{5-9} \cline{6-9} \cline{7-9} \cline{8-9} \cline{9-9} 
 & Auc-8 $\uparrow$ & Logloss-8 $\downarrow$ & Auc-16 $\uparrow$ & Logloss-16 $\downarrow$ & Auc-8 $\uparrow$ & Logloss-8 $\downarrow$ & Auc-16 $\uparrow$ & Logloss-16 $\downarrow$\tabularnewline
\hline 
IU & $60.35\pm0.54$ & $16.74\pm0.25$ & $60.56\pm0.61$ & $16.78\pm0.16$ & $60.48\pm0.47$ & $15.63\pm0.19$ & $60.62\pm0.60$ & $15.69\pm0.12$\tabularnewline
\hline 
BU-2 & $62.69\pm0.50$ & $16.04\pm0.19$ & $62.31\pm0.73$ & $16.15\pm0.20$ & $62.65\pm0.37$ & $15.21\pm0.15$ & $62.40\pm0.48$ & $15.22\pm0.13$\tabularnewline
SPMF-2 & $61.56\pm0.43$ & $18.30\pm0.31$ & $61.41\pm0.75$ & $18.48\pm0.20$ & $61.12\pm0.57$ & $15.74\pm0.21$ & $60.64\pm0.90$ & $15.65\pm0.13$\tabularnewline
ASMG-2 & $63.82\pm0.42$ & $16.51\pm0.28$ & $63.80\pm0.49$ & $16.54\pm0.19$ & $63.95\pm0.42$ & $15.00\pm0.19$ & $63.85\pm0.54$ & $\pmb{14.96\pm0.13}$\tabularnewline
Meta-2 & $\pmb{65.23\pm0.46}${*} & $\pmb{15.81\pm0.27}${*} & $\pmb{64.84\pm0.61}${*} & $\pmb{15.89\pm0.20}${*} & $\pmb{65.04\pm0.42}${*} & $\pmb{14.93\pm0.16}${*}& $\pmb{64.60\pm0.57}${*} & $\pmb{14.96\pm0.13}$\tabularnewline
\hline 
BU-3 & $63.55\pm0.46$ & $15.28\pm0.15$ & $63.40\pm0.64$ & $15.30\pm0.13$ & $63.65\pm0.40$ & $14.93\pm0.14$ & $63.41\pm0.51$ & $14.90\pm0.11$\tabularnewline
SPMF-3 & $60.73\pm0.55$ & $18.18\pm0.35$ & $61.00\pm0.87$ & $18.32\pm0.23$ & $61.83\pm0.54$ & $14.99\pm0.16$ & $61.32\pm0.62$ & $14.74\pm0.12$\tabularnewline
ASMG-3 & $63.21\pm0.49$ & $18.51\pm0.41$ & $63.35\pm0.69$ & $19.61\pm0.27$ & $65.02\pm0.41$ & $14.82\pm0.17$ & $64.77\pm0.53$ & $14.80\pm0.11$\tabularnewline
Meta-3 & $\pmb{67.20\pm0.25}${*} & $\pmb{15.09\pm0.18}${*} & $\pmb{67.05\pm0.38}${*} & $\pmb{15.10\pm0.14}${*} & $\pmb{66.92\pm0.26}${*} & $\pmb{14.65\pm0.15}${*} & $\pmb{66.78\pm0.37}${*} & $\pmb{14.62\pm0.11}$\tabularnewline
\hline 
BU-5 & $66.19\pm0.24$ & $14.76\pm0.18$ & $66.24\pm0.30$ & $14.71\pm0.13$ & $66.15\pm0.23$ & $14.54\pm0.15$ & $66.23\pm0.29$ & $14.49\pm0.11$\tabularnewline
SPMF-5 & $61.96\pm0.44$ & $14.69\pm0.13$ & $62.21\pm0.53$ & $14.74\pm0.10$ & $63.79\pm0.41$ & $14.83\pm0.18$ & $62.79\pm0.48$ & $14.53\pm0.13$\tabularnewline
ASMG-5 & $65.82\pm0.32$ & $14.79\pm0.14$ & $65.99\pm0.40$ & $14.79\pm0.11$ & $66.49\pm0.26$ & $14.50\pm0.14$ & $66.47\pm0.35$ & $14.50\pm0.10$\tabularnewline
Meta-5 & $\pmb{69.00\pm0.21}${*} & $\pmb{14.62\pm0.13}$ & $\pmb{69.37\pm0.19}${*} & $\pmb{14.61\pm0.11}$ & $\pmb{68.85\pm0.33}${*} & $\pmb{14.39\pm0.23}$ & $\pmb{69.15\pm0.28}${*} & $\pmb{14.38\pm0.22}$\tabularnewline
\bottomrule
\end{tabular}
}
\par\end{centering}
\centering{}
\vspace{-0.1cm}
\caption{Summarized result for CriteoTB. AUC/Logloss-x denotes the resulted based on the last x days examples. The averaged performance over three random seeds with its standard deviation are reported. We mainly compare the algorithm when the same $b$ is used and the best approach as bolded. The * denotes that the best result are statistically significant compared with the second best with p value less than 0.95 using matched-pair t-test.}\label{tbl:criteo}
\vspace{-0.3cm}
\end{table*}

We demonstrate the effectiveness of the proposed FGD.

\paragraph{Dataset.}
We consider two datasets CriteoTB and Avazu. CriteoTB has 13 integer feature fields and 26
categorical feature fields with around 800 million categorical tokens in total. It is the 24-day advertising data published by criteo. Training with the original CriteoTB dataset takes huge computational cost and to reduce computational overhead and increase reproducibility, we use a subsampled CriteoTB with 10\% of examples are sampled for evaluation. Avazu contains 11 days of clicks/not clicks data from Avazu and all its 22 feature fields are categorical. We preprocess both datasets following \citet{guo2017deepfm,liu2020learnable}.

\paragraph{Training Protocol.}
In real world recommendation systems, passing the examples multiple times for training might cause severe over-fitting issue \citep{zheng2020shadowsync,ye2020adaptive,du2021alternate}. Following \citet{zheng2020shadowsync,ye2020adaptive} we perform a single pass on the training data in the sense that each training example is only visited once throughout the training. Thus, we set $\th_{t}^0=\th_{t-b}$ during the model training at time $t$ because examples from domain $\D_{s}$, $s\le t-b$ has been visited for learning $\th_{t-b}$. In Algorithm \ref{alg:main}, the default scheme trains the recommendation models until the norm of the gradient is smaller than a threshold while in the experiment, we use the alternative strategy in which we train the model with a fixed number of iterations such that all the examples are passed exactly once.

\paragraph{Evaluation Protocol.}
As we consider an online learning environment, there is no need to split the dataset to training and testing subset. Instead, at the training time of $\th_t$, the data at the next day $\D_{t+1}$ is used to evaluate the performance of $f_{\th_t}$ and hence the domain generalization error is considered. Such evaluation protocol matches the real recommendation systems \citep{ye2020adaptive}. We adopt AUC (Area Under the ROC Curve) and Logloss to measure the performance. For Criteo1TB we evaluate the performance using the last 8 or 16 days and the first 16 or 8 days are considered to be offline training for warm up start. For Avazu, the first 3 days are treated to be offline training and hence only the last 8 days are used for evaluation. The metrics are averaged over all the days that are used for evaluation. For all the experimental settings, we run all the compared approaches 3 times with different random seeds and report the averaged result.

\paragraph{Models and Optimizers.}
We consider two representative architectures for recommendation models, FM \citep{rendle2010factorization} and DeepFM \citep{guo2017deepfm}. Following \citet{guo2017deepfm,liu2020learnable}, we use Adam as our optimizer and tune the learning rate for each compared methods from $\{0.01, 0.001, 0.0001, 0.00001\}$ using the performance of the offline training and the batch size is set to be 1024. For FGD, we add the model at the training trajectory into trajectory buffer every 150/50 iterations for CriteoTB/Avazu. The meta network is trained using SGD with learning rate 0.01 and batch size 20.

\paragraph{Baselines.}
For comparison, we consider the following optimization algorithms: Incremental Update (IU) \citep{wang2020practical} that updates the model incrementally only using the newly observed data $\D_t$; Batch Update (BU-$b$) \citep{wang2020practical} that updates the model using the most recent $b$ domains $\{\D_{t},...,\D_{t+1-b}\}$; Stream-centered Probabilistic Matrix Factorization (SPMF-$b$) \citep{wang2018streaming} in which a reservoir of historical examples are maintained to mix with the new data for current model updating. SPMF-$b$ denotes the setting that the example buffers has the same size as the number of examples in $b$ days; Adaptive Sequential Model Generation (ASMG-$b$) \citep{peng2021learning} that generates a better serving model from a sequence of $b$ most recent historical serving models via a meta generator; Future Gradient Descent (FGD-$b$) is our approach with the recent $b$ domains used for training the recommendation models.

\paragraph{Result.}
Table \ref{tbl:criteo} and \ref{tbl:avazu} summarized the results for CriteoTB and Avazu, respectively. The proposed FGD out-performs the baselines in most cases. We also observe that increasing $b$ improves the performance for most algorithms as more information can be utilized. The performance boost of FGD when increasing $b$ is more significant than other approaches.
% We think the reason is that FGD actively optimizes the way to integrate the historical domains so that it gives much smaller domain generalization error when more historical information are allowed.
Compared with CriteoTB, FGD is less significantly better in Avazu dataset. We think the reason might be that the domains of different days in Avazu are less different compared with that in CriteoTB.

\begin{table}
\centering{}%
\scalebox{0.73}{
\begin{tabular}{l| ll | ll }
\toprule 
\multirow{2}{*}{Method} & \multicolumn{2}{c|}{FM} & \multicolumn{2}{c}{DeepFM}\tabularnewline
\cline{2-5} \cline{3-5} \cline{4-5} \cline{5-5} 
 & Auc $\uparrow$ & Logloss $\downarrow$ & Auc $\uparrow$ & Logloss $\downarrow$\tabularnewline
\hline 
IU & $73.82\pm0.18$ & $39.92\pm0.86$ & $73.99\pm0.22$ & $39.80\pm0.81$\tabularnewline
\hline 
BU-2 & $74.16\pm0.25$ & $39.71\pm0.88$ & $74.31\pm0.21$ & $39.59\pm0.86$\tabularnewline
SPMF-2 & $69.31\pm0.31$ & $45.51\pm0.99$ & $71.11\pm0.53$ & $42.09\pm0.59$\tabularnewline
ASMG-2 & $\pmb{74.22\pm0.20}$ & $\pmb{39.66\pm0.89}$ & $\pmb{74.34\pm0.19}$ & $39.58\pm0.85$\tabularnewline
Meta-2 & $\pmb{74.22\pm0.28}$ & $39.77\pm0.90$ & $\pmb{74.34\pm0.21}$ & $\pmb{39.54\pm0.87}$\tabularnewline
\hline 
BU-3 & $74.17\pm0.31$ & $\pmb{39.68\pm0.89}$ & $74.50\pm0.30$ & $39.48\pm0.90$\tabularnewline
SPMF-3 & $68.95\pm0.56$ & $47.17\pm1.27$ & $71.93\pm0.24$ & $41.83\pm0.64$\tabularnewline
ASMG-3 & $73.64\pm0.08$ & $39.93\pm0.83$ & $73.95\pm0.17$ & $39.82\pm0.83$\tabularnewline
Meta-3 & $\pmb{74.20\pm0.27}${*} & $\pmb{39.68\pm0.89}$ & $\pmb{74.55\pm0.28}${*} & $\pmb{39.45\pm0.90}$\tabularnewline
\bottomrule 
\end{tabular}
}
\vspace{-0.1cm}
\caption{Summarized result for Avazu. The setting of the table is the same as that of Table \ref{tbl:criteo}.} \label{tbl:avazu}
\vspace{-0.1cm}
\end{table}

\paragraph{Temporal Domain Shift and Forecast Error of MFGG.}
\begin{figure}[t]
\begin{centering}
\includegraphics[scale=0.25]{fig/gnorm_modified.pdf}
\includegraphics[scale=0.25]{fig/gerror.pdf}
\par\end{centering}
\caption{Left: evolution of $\|\nabla r_t(\th_{t,i})\|^2$. Right: the normalized forecast error of MFGG in different time and iterations.} \label{fig:analysis}
\vspace{-0.5cm}
\end{figure}
To visualize the effect of the temporal domain shift, we plot the gradient norm during the whole training process. We consider FGD-3 in CriteoTB with DeepFM as the recommendation models. In this examples, at each time $t$, the recommendation model is trained with $R=20K$ iterations. At time $t-1$, denote $\th_{t,i}$ as the parameter at the $i$-th iteration of the training (note that after the training $\th_{t}$ is used to predict examples in $\D_{t}$). We visualize the evolution of the gradient norm of the future domain $g_{t,i} = \|\nabla r_t(\th_{t,i})\|^2$ in a chronological order (i.e., $..., g_{t,1},...,g_{t,R}, g_{t+1,1},...,g_{t,R}, ...$) in the left subfigure of Fig \ref{fig:analysis}. Overall, $g$ is decreasing suggesting the improving performance but significant fluctuation of $g$ is also observed: when we shift from $t$ to $t+1$, $g$ will suddenly increase demonstrating a considerable deviation between the adjacent domains. We also visualize the (normalized) forecast error $e_{i,t}$ of MFGG in the right subfigure of Fig \ref{fig:analysis}
\[
e_{t,i}=\frac{\|m(\th_{t+1,i};\phi_{t},t)-\nabla r_{t+1}(\th_{t+1,i})\|^{2}}{\|\nabla r_{t+1}(\th_{t+1,i})\|^{2}}.
\]
Here, we normalize the error by the gradient norm $\|\nabla r_{t+1}(\th_{t+1,i})\|^{2}$ to rule out the effect of the decrease of gradient norm. We observe a decrease of the forecast error demonstrating that the gradient of future domain can be predicted using the past domains. Besides, the error remains stationary which provides evidence that the modeling the MFGG as a functional time-series model is reasonable.

\textbf{Optimizing MFGG with Random Model.}
When optimizing MFGG, the loss is calculated based on a model $f_\th$ sampled from its training trajectory so that we make MFGG focus on giving good prediction on the gradient of $f_\th$ that has reasonable performance. To show the importance of such design, we also run FGD in which MFGG is optimized using $f_\th$ with $\th$ randomly initialized. We consider the setting of FGD-3 in CriteoTB and use both FM and DeepFM as recommendation model and summarize the result in Table \ref{tbl:rand}. It can be shown that train the MFGG with random recommendation model degenrates the performance.

\begin{table}
\begin{centering}
\scalebox{0.68}{
\begin{tabular}{l|l|llll}
\toprule 
\multirow{1}{*}{Buffer} & Method & Auc-8 $\uparrow$ & Logloss-8 $\downarrow$ & Auc-16 $\uparrow$ & Logloss-16 $\downarrow$\tabularnewline
\hline 
\multirow{2}{*}{FM} & Rand & $67.08\pm0.28$ & $15.17\pm0.21$ & $67.08\pm0.41$ & $15.28\pm0.16$\tabularnewline
 & Traj & \pmb{$67.20\pm0.25$} & \pmb{$15.09\pm0.18$} & $67.05\pm0.38$ & \pmb{$15.10\pm0.14$}\tabularnewline
\hline 
\multirow{2}{*}{DeepFM} & Rand & $66.83\pm0.27$ & $14.68\pm0.16$ & $66.68\pm0.41$ & $14.66\pm0.12$\tabularnewline
 & Traj & \pmb{$66.92\pm0.26$} & $14.65\pm0.15$ & \pmb{$66.78\pm0.37$} & \pmb{$14.62\pm0.11$}\tabularnewline
\bottomrule 
\end{tabular}
}
\par\end{centering}
\centering{}
\vspace{-0.1cm}
\caption{Comparing the performance when MFGG is trained with model sampled from optimization trajectory (Traj) and randomly initialized model (Rand). The setting of the table is the same as that of Table \ref{tbl:criteo}.} \label{tbl:rand}
\vspace{-0.4cm}
\end{table}

\paragraph{Computation Overhead.}
We compare the wall clock training time of BU and FGD. We consider the DeepFM model in CriteoTB and report the averaged training time with different $b$ at each time $t$ in Table \ref{tbl:time}. It can be shown that the proposed FGD introduces only about 15\% overhead.
\begin{table}
\begin{centering}
\scalebox{0.83}{
\begin{tabular}{c|cc|cc|cc}
\toprule 
\multirow{2}{*}{Time/min} & BU-2 & Meta-3 & BU-3 & Meta-3 & BU-3 & Meta-23\tabularnewline
\cline{2-7} \cline{3-7} \cline{4-7} \cline{5-7} \cline{6-7} \cline{7-7} 
 & 20.4 & 24.3 & 29.7 & 33.8 & 47.6 & 52.2\tabularnewline
\bottomrule 
\end{tabular}
}
\par\end{centering}
\centering{}
\vspace{-0.1cm}
\caption{Comparing the wall clock training time of BU and FGD at each round ($t$).}\label{tbl:time}
\vspace{-0.4cm}
\end{table}