\section{Related Work}
\paragraph{Domain Generalization.}
Our problem can be viewed as an extension of the classic domain generalization
problem. In short, the classic domain generalization problem that is extensively studied in vision or NLP is \emph{one-shot} in the sense that it aims to generalize a model to one unseen target domain by training over multiple source domains. In contrast, our problem is \emph{$T$-shot}, since
we have a stream of $T$ pairs of target/source domains. The difference between one-shot and $T$-shot can be significant.
In the one-shot setting, we are unable to receive feedback on how the model generalizes on the unseen domain and thus the existing algorithms are hence focus on improving the worst-case generalization by learning domain-invariant representation based on methods such as domain feature alignment \citep{li2018domain,guo-etal-2019-towards}, causal learning \citep{arjovsky2019invariant,wang2022provable}, multi-task learning \citep{carlucci2019domain}, meta-learning \citep{balaji2018metareg,li2018learning} and data augmentation \citep{yan2020improve,ilse2021selecting}. In comparison, our algorithm mainly focuses on how to use the feedback in the $T$-shot setting to learn to predict the gradient information of the future unseen domain. While adopting the techniques from the one-shot domain generalization is of interest, the design of those algorithms utilizes a lot of domain knowledge from CV or NLP, making it non-trivial to apply to recommendation systems. We thus leave it for future work.

\paragraph{Continual Learning.}
Continual learning is a similar scenario where the goal is to learn an accurate model given a stream of different tasks/domains. Compared with multi-task learning \citep{sener2018multi,crawshaw2020multi,ye2021pareto,wang2021bridging}, the key challenge of continual learning is \textit{catastrophic forgetting} \citep{kirkpatrick2017overcoming}: the model forgets how to solve past tasks after it is exposed to new tasks. Various of types of solutions are proposed, including rehearsal-based methods \citep{lopez2017gradient,aljundi2019gradient,chaudhry2020using}, knowledge distillation \citep{rebuffi2017icarl}, regularization \citep{kirkpatrick2017overcoming,buzzega2020dark} and architecture adjustment \citep{rusu2016progressive,serra2018overcoming}. Although the learning scenario is similar, a direct application of continual learning methods to our setting might not give a desirable outcome. The reason is that the final goals of the two problems are quite different: continual learning aims to learn the current task without sacrificing the performance of the past learned tasks, while we only focus on performing well in the unobserved future task.

\paragraph{Gradual Domain Adaptation} Gradual domain adaptation (GDA) aims at adapting a model to an unlabeled target domain after being trained on a labeled source domain and a sequence of unlabeled intermediate domains. Despite being similar to the setting of temporal domain generalization, GDA is still different from the latter since there are no labels provided in the intermediate domains for GDA. A modern and common approach for GDA is gradual self-training \citep{kumar2020understanding,wang2022understanding,zhou2022online,dong2022algorithms}, which fits a model to the source domain and then adapts the model along the sequence of intermediate domains consecutively with self-training \citep{nigam2000analyzing}.

\paragraph{Meta-Learning.} Meta-learning, or learning-to-learn, aims to optimize the training process such that the outcome is improved. Examples of meta-learning includes learning a better initialization \citep{finn2017model,lee2018gradient}, optimizer \citep{andrychowicz2016learning,flennerhag2019meta}, hyper-parameter \citep{franceschi2018bilevel,chen2019lambdaopt} and network architecture \citep{liu2018darts,wang2022global}. The proposed FGD can be viewed as \textit{learning a better optimizer} for the temporal domain generalization problems. Meta-learning is also widely deployed in recommendation systems. Examples include solving cold start issue \citep{bharadhwaj2019meta,lee2019melu} through learning initialization and knowledge transferring through model fusion \citep{zhang2020retrain,peng2021learning}.

