\section{Introduction}
% importance of rs
The web-scale recommendation system is one of the most important modern machine learning applications that provides personalized content to billions of users from inventories of billions of items. These recommendation models have been rapidly growing in both computation and memory in the past few years due to wider-deeper networks and the use of sparse embedding layers.
% the sequential domain shift problem
% One of the key challenges in training online recommendation models is the temporal shift of data distribution (i.e., domain) caused by the drifting of user interest and changing of item popularity. Various works \citep{papagelis2005incremental,rendle2008online,rendle2009bpr,sculley2015hidden,devooght2015dynamic,ye2020adaptive,he2020lightgcn,zhang2020retrain,peng2021learning} 
\citep{ye2020adaptive,he2020lightgcn,zhang2020retrain,peng2021learning} have demonstrated the importance of updating the recommendation periodically (e.g., in a daily/weekly basis) as  new data arrives to avoid the model being stale in a domain shifting environment.

Designing such a periodical updating pipeline is non-trivial: the algorithm needs to achieve a good balance of consolidating the long-term memory (ensuring the useful past knowledge is preserved) and capturing short-term tendency, which is valuable for near future prediction \citep{zhang2020retrain,peng2021learning,deng2021deeplight}. Algorithms can be categorized into two groups:  1. The \emph{sample-based} approaches maintain a reservoir to reuse observed historical examples to preserve long-term memory \citep{diaz2012real}. Several heuristics are developed to select past examples via balancing the prioritizing of recency and forgetting \citep{chen2013terec,wang2018streaming,qiu2020gag,zhao2021stratified}; 2. The \emph{model-based} approaches maintain the long-term memory by transferring knowledge between the past and the current model checkpoints via knowledge distillation \citep{wang2020practical,xu2020graphsail,mi2020ader} and model fusion \citep{zhang2020retrain,peng2021learning}.

% our perspective
In this paper, we provide a novel perspective by framing the problem of learning under shifting domains as a \emph{temporal domain generalization problem}. We observe that the crux lies in the mismatch between the distribution of the training examples and the distribution of the testing example on which the model is deployed for recommendations. From this perspective, existing approaches mitigate such the crux by the distribution mismatch in an \emph{indirect}  way by training a robust model that is less vulnerable to the shift of domain in the near future by making it a master at both short-term and long-term signals. More precisely, we propose a more \emph{direct} solution towards the temporal domain generalization problem based on forecasting the future information for training. Consider the ideal case that we are able to access the data distribution in the near future when the model is deployed, simply training the model by gradient descent using examples drawn from the future data distribution should be desirable (see `Ideal Update' in Fig \ref{fig:compare}). In the real world when such future information is unavailable, we propose to train a meta future gradient generator to forecast the gradient of the future examples so that the recommendation model is trained as if we were able to look ahead at the future (i.e., `FGD Update' in Fig \ref{fig:compare}). In addition to the sample-based and model-based approach, our method is \emph{optimizer-based} in that the trainer of the recommendation model is improved.

In theory, we frame the problem as an online learning problem in which the temporal domain generalization error is captured by the gradient variation term \citep{chiang2012online,rakhlin2013online} in a local regret \citep{hazan2017efficient}. We provide a theoretical understanding of why the proposed algorithm improves over batch-update, a widely used training pipeline \citep{wang2020practical} and show that our method is able to achieve a similar regret as that of a fixed meta future gradient generator oracle. Empirically, we compare our approach against several representative sample/model-based approaches and observe considerable performance improvement. 


\iffalse
Web scale recommendation system is one of the most important modern applications that provides personalized content to billions of users from inventories of billions of items. These recommendation models have seen rapidly growing in both computation and memory in the past few years due to wider-deeper networks and the use of sparse embedding layers. Industrial recommendation systems typically deploy models up to hundreds of terabytes in size and with compute complexity touching the order of giga-Flops. These large scale models are very data-hungry and can consume hundreds of billions of data samples for increased model fidelity. Given the scale of the data, these systems require significant amount of resources for training. Moreover these models need to be retrained periodically to adapt to the changing data distribution which can be attributed to the volatility of user interests and constant influx of fresh content. This can lead to significant overhead if the model needs to be re-trained from scratch every hour or every day. Hence, most industrial systems resort to incremental training where the model is only trained with fresh incoming data starting from previous checkpoints. 

While in practice incremental training obviates the cost of re-training from scratch, it leads to new issues. First, the data demonstrates strong daily and weekly seasonality, and hence the model tends to have oscillatory learning patterns which reduces the overall efficiency of the system. Second, due to the small volume of data in every incremental training step, the optimizer algorithms tend to overfit to the new data. This problem typically gets exacerbated further as model size increases. Third, the state-of-the-art optimizers for recommendation systems (e.g., Adagrad) are not capable of handling such distribution shifts, since in practice, the optimizers gradually decrease the learning rate during training, which effectively attenuates the learning trajectory. Hence, we need to examine the optimization process modern large models under temporal gradual domain shifts, and propose a remedy for the optimization issues.
<Insert figure here that talks about the oscillatory nature of learning>

The core problem here is that the optimizer only looks at the last checkpoint and hence makes localized decisions on using just the fresh data. A naive way to remedy this would be to instead start from previous checkpoint and retrain on a slightly larger window of data that will allow the optimizer to look further into the past, but just replaying the data from past sequentially is still sub-optimal because it does not break the seasonality of the data. In this work we formulate this problem as a gradual domain shift problem where every chunk of fresh data (e.g. an hour) is considered a a new domain with gradual shift. We then propose a novel auto-regressive meta learning network that learns the optimization trajectory holistically to predict the network parameters for the next period.

<list contributions>
\fi