\section{Diffusion Models}
In this section we quckly review the basics of diffusion models.
We focus on the stochastic differential equation formulation
first presented by \citet{song2021scorebased}.

Let $p(\bs{y})_\text{data}$ denote the data distribution.
The goal of a diffusion model is to learn a mapping from a
simple distribution $p(\bs{z})$ to the data distribution $p(\bs{y})_\text{data}$.

This is achived by reversing a diffusion process.
In particular, we construct a stochastic differential equation $\bs{y}(t)$ from
$t \in [0, T]$
such that $\bs{y}(0) \sim p(\bs{y})_\text{data}$ and $\bs{y}(1) \sim p(\bs{y}(T))$
is a simple distribution we can sample from and whose evolution is given by
\begin{align}
  d \bs{y}(t) = \bs{f}(\bs{y}(t), t) dt + \bs{g}(t) d\bs{w}(t),
\end{align}
where $\bs{w}(t)$ is a standard Brownian motion and $\bs{f} : \mathbb{R}^d \times [0, T] \to \mathbb{R}^d$
and $\bs{g} : [0, T] \to \mathbb{R}$ are called the drift coefficient and the diffusion coefficient, respectively.

It is possible to reverse this SDE and sample from $p(\bs{y})_\text{data}$ by first sampling
$\bs{y}(1) \sim p(\bs{y}(T))$ and then evolving the system backwards in time. This is done by solving the
reverse SDE (Cite anderson 1982)

\begin{align}
    d \bs{y}(t) = [f(\bs{y}(t), t) - g(t)^2 \nabla_{\bs{y}(t)} \log p(\bs{y}(t))] dt + g(t) d\bar{\bs{w}}(t).
  \end{align}
where $\bar{\bs{w}}(t)$ is a standard Brownian with reversed time. Thus because $\bs{f}$ and $g$ are known,
and we construct the SDE so that $p(\bs{y}(T))$ is simple, as long
as we know the score $\nabla_{\bs{y}(t)} \log p(\bs{y}(t))$ we can sample from $p(\bs{y})_\text{data}$.

\subsection*{Estimating the Score}
An important result by \citet{Vincent2010} is that it is possible to estimate the score
$\nabla_{\bs{y}(t)} \log p(\bs{y}(t))$ by computing
\begin{align}
    \bs{s}^* = \text{argmin}_{\bs{s} \in \mathcal{S}}  \EE_{t}\EE_{p(\bs{y}(0))_\text{data}}\EE_{p(\bs{y}(t)|\bs{y}(0))} \left[ \left\| \nabla_{\bs{y}(t)} \log p(\bs{y}(t)| \bs{y}(0)) - \bs{s}(\bs{y}(t), t) \right\|^2 \right].
\end{align}
where $\mathcal{S} = \{ \bs{s}: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d \}$ is the set of all possible score functions
indexed by time $t$, and $\EE_{t}$ denotes the expectaion over uniformly sampled $t \in [0, T]$.

\subsection*{Conditional Diffusion Models}
Although it is most common to train diffusion models unconditionally as explained above, one can also train
diffusion models conditionally on some input $\bs{x}$.

To do so we make the following modifications to the above formulation.
\begin{enumerate}
    \item We construct one separate SDE per value of $\bs{x}$. Each SDE shares the same drift and diffusion coefficients
    but the initial distribution $p_{\bs{x}}(\bs{y}(0))$ is given by $p(\bs{y}| \bs{x})_{\text{data}}$.
    \item The reverse SDE is now given by
    \begin{align}
        d \bs{y}(t) = [f(\bs{y}(t), t) - g(t)^2 \nabla_{\bs{y}(t)} \log p_{\bs{x}}(\bs{y}(t))] dt + g(t) d\bar{\bs{w}}(t).
    \end{align}
    where $\nabla_{\bs{y}(t)} \log p_{\bs{x}}(\bs{y}(t))$ is the score of the conditional distribution $p_{\bs{x}}(\bs{y}(t))$.
    Importantly, because we choose the diffusion and drift coefficients so that at $t = T$ the distribution is the same
    for all values of $\bs{x}$, we can still sample from the data distribution in the same way as before.
    \item The final change is that the score function is now estimated by
    \begin{align*}
    \bs{s}^* = \text{argmin}_{\bs{s} \in \mathcal{S}}  \EE_{t}\EE_{p(\bs{y}(0), \bs{x})_\text{data}}\EE_{p(\bs{y}(t)|\bs{y}(0))} \left[ \left\| \nabla_{\bs{y}(t)} \log p(\bs{y}(t)| \bs{y}(0)) - \bs{s}(\bs{y}(t),\bs{x}, t) \right\|^2 \right].
    \end{align*}
    with the changes being that now $\mathcal{S} = \{ \bs{s}: \mathbb{R}^d \times \mathbb{R}^m \times [0, T] \to \mathbb{R}^d \}$ is the set of all possible
    score functions but now allowing for the score to depend on the input $\bs{x}$, and the expectation is taken over the joint distribution
    $p(\bs{y}(0), \bs{x})_\text{data}$.
    We emphasize that the score of the conditional distribution $p(\bs{y}(t)| \bs{y}(0))$ is still
    the same because once we condition on $\bs{y}(0)$ the distribution is the same for all values of $\bs{x}$.

\end{enumerate}
This formulation of conditional diffusion models is different than controllable generation as
presented in \cite{song2021scorebased}. There, a conditional diffusion model is constructed by noting
that
\[ \nabla_{\bs{y}(t)} p(\bs{y}(t)| \bs{x}) = \nabla_{\bs{y}(t)} p(\bs{y}(t)) + \nabla_{\bs{y}(t)} p(\bs{x}| \bs{y}(t)) \]
and hence if we obtain the first term from an unconditional diffusion model, and the second term by differentiating
through another trained model $p(\bs{x}| \bs{y}(t))$, we can obtain the score of the conditional distribution.
In our case this is not feasible because in general the dimension of $\bs{y}$ will be much smaller than the dimension of $\bs{x}$.


\section{Gradient Boosted Trees}
Gradient Boosted Trees (GBT) \cite{friedman2001greedy} are a popular non-parametric machine learning model for
function approximation.
The objective is to find a function $F: \mathbb{R}^d \to \mathbb{R}$ that minimizes
\begin{align}
    L(F) = \EE_{\bs{x}, y} \left[ l(y, F(\bs{x})) \right],
\end{align}
where $l: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$ is a loss function, and the expectation is taken over the joint distribution
of the input $\bs{x}$ and the target $y$.
It does this by imposing the requirement that $F$ is as a scaled sum of $M$ decision
trees $f_m: \mathbb{R}^d \to \mathbb{R}$, i.e.
\begin{align}
    F(\bs{x}) = \sum_{m = 1}^{M} \epsilon f_m(\bs{x}), \quad \epsilon \in (0, 1).
\end{align}
where $\epsilon$ is a learning rate or shrinkage parameter.
In the most basic form of the algorithm each tree is constructed to approximate gradient descent on the loss function $L(F)$.
In particular, if we let $F_i = \sum_{m = 1}^{i} f_m$ denote the function after $i$ iterations and
then the $i$-th tree is constructed to approximately minimize the squared error
\begin{align}
    f_i = \text{argmin}_{f} \EE_{\bs{x}, y} \left( f(\bs{x}) - \left. \frac{\partial l(y, \hat{y})}{\partial \hat{y}}\right|_{\hat{y} = F_i(\bs{x})} \right)^2.
\end{align}
using empirical risk minimization and a greedy algorithm to construct the tree.

Various modifications to the basic algorithm have been proposed and implemented such as
regularization, special ways of optimizing the tree, support for categorical functions,
higher order optimization\cite{chen2016xgboost, ke2018lightgbm,prokhorenkova2019catboost}.
In this paper we focus on the implementation of GBTs in the LightGBM library \cite{ke2018lightgbm}
with the understanding that the same principles apply to any other GBT implementation.
