


\section{Method}\label{sec:method}

Recall that our goal is to learn $f_{\th_{t}}$
that gives accurate prediction on $\D_{t}$ (i.e. achieves small $r_{t}$).
Think of the ideal world where we were able to access the data $\D_{t}$
at the near future during the training time of $f_{\th_{t}}$, a simple
while promising approach is to apply gradient descent using $\nabla r_{t}$
(see the first plot in Fig. \ref{fig:compare}). In the real case where the future information $\nabla r_{t}$
is no more available, we propose to learn a meta future gradient generator (MFGG) that forecasts $\nabla r_{t}$ given the observed data $\cup_{i=1}^{t-1}\D_{i}$; see the third plot in Fig~\ref{fig:compare}.
\iffalse
We are mainly interested in recommendation systems equipped with deep neural networks, which are highly non-convex. In order to learn a model that generalizes well in the unseen test domain $\D_{t}$, we are looking for a model $f_{\th_{t}}$ at time step $t$ that minimizes $\|\nabla r_{t}(\th_{t})\|$. This goal is easy to achieve if we know $\D_{t}$ at the training time of $f_{\th_t}$ by simply applying gradient descent on loss $r_{t}$. In the case that $\D_{t}$ is unobserved at training time, we propose to learn a meta network $m(\th_t;\phi,t)$ parameterized by $\phi$ that gives an accurate prediction of the gradient $\nabla r_t(\th)$, given the model parameters $\th_t$ at time $t$. Intuitively, if we use the generated gradient $m(\th;\phi,t)$ to update the model parameter $\th$, at the convergence, we have $\|m(\th_t;\phi,t)\|\approx 0$. Thus, once $m(\th_t;\phi,t) \approx \nabla r_t (\th_t)$, the norm $\|\nabla r_t (\th_t)\|$ should almost vanish.
\fi


\paragraph{Architecture of MFGG.}
MFGG models $\nabla r_{t}(\th)$ as an non-linear functional auto-regressive time series model \citep{bosq2000linear}. It approximates $\nabla r_{t}(\th)$ by aggregating the gradient
based on the latest $b$ losses $\sum_{i=0}^{b-1}a_{i}(\D_{t-b},...,\D_{t-1})\nabla r_{t-1-i}(\th)$
where the coefficient of the linear combination $a_{i}(\D_{t-b},...,\D_{t-1})$
is a neural network given by the following computation graph.
\iffalse
At time $t$, the goal of the meta network is to predict $\nabla r_t(\th)$ by aggregating the information from the observed past domains $\cup_{s\in[t-1]}\D_{s}$. Similar to the common practice of the sequential data modeling {\color{red} add some reference}, we use the information from the latest $b$ domains (i.e., $\cup_{s\in\{t-1,...,t-b\}}\D_{s}$) to forecast $\nabla r_t(\th)$. For each domain $\D_s$, $s\in \{t-1, ..., t-b\}$, the meta network takes $n_s$ feature examples $x_{i,s}$ as input to represent $\D_s$ and computes in the following way \qq{introduce the general idea that $m = \sum_i a_i (\mathcal D) \nabla r$ first where $a_i$ is a general neural network, and then open an independent section to introduce the specific choice of architecture of $a_i$}:
\qq{the psuedo code of the architecture is not very reable as a part of the paper}
\fi
\begin{align*}
e_{i,j} & =\text{Embd}(x_{j}^{(i)})\in\mathbb{R}^{d_{1}}\\
e_{j} & =\sum_{i\in[n_{j}]}e_{i,j}\in\mathbb{R}^{d_{1}}\\
z & =\text{Self Attention}(e_{t-b},...,e_{t-1})\in\mathbb{R}^{d_{2}\times b}\\
a & =\text{Softmax}\circ\text{MLP}(z_{t-b},...,z_{t-1})\in\mathbb{R}^{b}
\end{align*}
Here Embd denotes the embedding layer that maps the categorical feature into a continuous embedding space (the continuous feature remains the same in this layer); Self Attention denotes the self attention layer \citep{vaswani2017attention}; MLP denotes the multi-layer perception. MFGG first extracts the domain features $e_j$ over $j\in\{t-b, ..., t-1\}$ of the last $b$ domains) and the self attention then encodes the interaction between the domain features, of which the outcomes are fed into the subsequent layers to calculate the coefficient $a$. The softmax layer is option and regularizes $a$ to be in a probability simplex $S_b$ and hence ensures the magnitude of the generated gradient is within a proper range. Suppose $\phi$ unions all the parameters, we denote MFGG as $m(\th;\phi,t)$. In practice, we can simply replace $\D_j$ with its mini-batch samples $\hat{\D}_j$, which gives a stochastic gradient version for updating.


\paragraph{Optimization of MFGG.}
We use the squared $\ell_2$ loss $\|m(\th;\phi,t)-\nabla r_{t}(\th)\|^{2}$ for measuring the prediction error of $m(\th;\phi,t)$ at time $t$. Such error depends on both $\phi$, the parameter of MFGG and $\th$, the parameter of recommendation model used for calculating the gradient. We are more interested in make MFGG accurate at a small subset of the model parameter space $\Theta$ in which $\th$ gives a recommendation model with good performance. We thus only apply the $\ell_2$ loss on the (sub-sampled) optimization trajectory of $\th$, which we denoted as $B$. That is, we learn $m(\th;\phi,t)$ by apply gradient descent on 
\[
\sum_{\th\in B}\|m(\th;\phi,t)-\nabla r_{t}(\th)\|^{2}.
\]
Note that here when calculating the gradient of $\phi$, $\th$ is viewed as a constant and hence the differentiation of $\phi$ at $\th$ does not applied. Algorithm \ref{alg:main} summarizes the detailed procedure. Again, a mini-batch version of $m(\th;\phi,t)$ and $\nabla r_t(\th)$ can be used during the training of MFGG. In practice at $t\le b$, we don not have enough historical data to compute MFGG, we can simply use IU for training (alternatively, data for offline training can be used instead). Since our approach uses the MFGG to predict the gradient of the loss on the unobserved future data, we name it Future Gradient Descent (FGD).

\paragraph{Extension to a smoothed loss.}
In practice, one might be interested in a smoothed version of performance metric as it is observed to be a potentially more robust evaluation metric in practice \citep{he2014practical}. More precisely, consider the loss function
\begin{equation} \label{eq:dynamic_regret_smooth}
   \frac{1}{T}\sum_{t=1}^{T}\left[\frac{1}{w}\sum_{i=0}^{w-1}r_{t-i}(\th_{t})\right],
\end{equation}
where $r_s$ is identically zero for $s\le0$. This smoothed loss in (\ref{eq:dynamic_regret_smooth}) uses a sliding window with width $w$ over the previous datasets $\cup_{i=0}^{w-1} D_{t-i}$ when evaluating. We are mainly interested in the standard metric (\ref{eq:dynamic_regret}) but when (\ref{eq:dynamic_regret_smooth}) is considered, we can simply generalize FGD by replacing $m(\th;\phi,t)$ by
\[
\bar{m}(\th;\phi,t)=\frac{1}{w}\left(m(\th;\phi,t)+\sum_{i=1}^{w-1}\nabla r_{t-i}(\th)\right),
\]
when training $\th$. Here $\nabla r_s$, $s\le0$ is defined 0. We refer readers to Algorithm \ref{alg:main_generalized} in Appendix \ref{apx:generalize_fgd} for the details. In the rest of the paper, we focus on the smoothed version of loss as it is more general.

Before moving forward, we emphasize the difference between the two window sizes $b$ and $w$ that appear in the BU/FGD and in the definition of \eqref{eq:dynamic_regret_smooth}, respectively. In some sense, $b$ corresponds to the number of recently observed datasets used for training the model. While, $w$ represents the number of datasets used for testing the model.