\section{Background}\label{sec:background}

In this section we fix the notation of the paper, recall
gradient boosted ensembles and the XGBoost algorithm, and
outline adversarial robustness and robust training for
tree-based models. 

{\bf Gradient Boosted Ensembles.}  Let
$\mathcal{D}=\left\{\left(\mathbf{x}_i,
y_i\right)\right\}\left(|\mathcal{D}|=n, \mathbf{x}_i \in
\mathbb{R}^m, y_i \in \mathbb{R}\right)$, be a dataset with $m$
features and $n$ datapoints.  Gradient boosting~\citep{gradientboost}
is the process of sequentially adding weak learners to an ensemble of
learners to minimise a certain loss function on $\mathcal D$. The
prediction $\hat{y}_i$ of an ensemble $F$ with $K$ weak learners is
\begin{equation}
    \hat{y}_i=F(\mathbf{x}_i)=\sum_{k=1}^K f_k\left(\mathbf{x}_i\right),
    % , \quad f_k \in \mathcal{F}
    \label{eq:boost_ensemble}
\end{equation}
where each $f_k$ is an independent weak learner. The total
loss of the ensemble at iteration $t$ is
%PK: Prediction at an iteration has not been defined.
\begin{equation}
    \mathcal{L}^{(t)}=\sum_{i=1}^n l\left(y_i,
    \hat{y}_i^{(t-1)}+f_t\left(\mathbf{x}_i\right)\right),
    \label{eq:boost_ensemble_loss}
\end{equation}
where $l$ is an arbitrary, differentiable loss function. In general,
while any model can be used as a weak learner, decision trees are
often chosen because of their expressivity and ease of training.  In
this work, we consider greedily trained binary trees with coordinate
aligned splits of the form $f_t(\mathbf{x_i}) = w_{q(\mathbf{x_i})}$,
where $q(\mathbf{x_i})$ is the tree traversal function that maps an
input $\mathbf{x_i}$ to a leaf node with value
$w_{q(\mathbf{x_i})}$. The tree traversal function $q_{\mathbf{x_i}}$
is learned using a greedy algorithm that recursively splits the input
space to minimise the loss function.

{\bf The XGBoost algorithm.}
The XGBoost algorithm \citep{chen2016xgboost} is a popular
implementation of  gradient boosting that is widely used in
various tasks because of its efficiency and scalability. The
algorithm constructs new weak learners by optimising a
second-order Taylor approximation of the loss function. For their
construction, it introduces regularisation to penalise complex
tree structures and large leaf node values, thereby
weakening overfitting. Concretely, the loss function from
Equation
\ref{eq:boost_ensemble_loss} is approximated as
\begin{equation}
    \begin{split}
        &\mathcal{L}^{(t)} \simeq \sum_{i=1}^n\left[l\left(y_i, \hat{y}^{(t-1)}\right) +g_i f_t\left(\mathbf{x}_i\right)\right. \\
        &\qquad\qquad\left. +\frac{1}{2} h_i
        f_t^2\left(\mathbf{x}_i\right)\right]+\Omega\left(f_t\right),
    \end{split}
\end{equation}
where $g_i$ and $h_i$ are the first and second-order
gradient statistics on the loss function, and
$\Omega(f)=\gamma T+\frac{1}{2} \lambda\|w\|^2$ is the
regularisation term with hyperparameters $\gamma$ and
$\lambda$ used to respectively control the  penalisation of
the number of leaf nodes $T$ and leaf node values $w$.
Using this approximate loss formulation,
the optimal value of a leaf $z$ can be computed as:
\begin{equation}
    w_z^*=-\frac{\sum_{i \in I_z} g_i}{\sum_{i \in I_z}
    h_i+\lambda},
\end{equation}
where $I_z$ is the set of indices of datapoints that reach
leaf $z$. The algorithm recursively identifies the best
series of splits to minimise the loss function. In the exact
greedy algorithm, each feature value is considered as a
potential split. The split with the greatest loss reduction,
or split score $\mathcal{S}$, is selected. 
%PK: j and \eta are not used so I would not introduce
%notation for them.
The split score
is a function of the threshold value $\eta$ and the feature
index $j$. When splitting a parent node containing the set
of training data points $\mathcal{I}$ into left and right
child nodes, containing the set of points $\mathcal{I}_L$
and $\mathcal{I}_R$ respectively, we express the split score
as a function of $\mathcal{I}_L$ and $\mathcal{I}_R$ as:
\begin{equation}
    \label{eq:xgb-split-score}
    \begin{split}
        \mathcal{S}(\mathcal{I}_L, \mathcal{I}_R) &= \frac{1}{2} \left[ \frac{\left(\sum_{i \in I_L} g_i\right)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{\left(\sum_{i \in I_R} g_i\right)^2}{\sum_{i \in I_R} h_i + \lambda} \right. \\
        &\left. - \frac{\left(\sum_{i \in I}
        g_i\right)^2}{\sum_{i \in I} h_i + \lambda} \right]
        - \gamma.
    \end{split}
\end{equation}

{\bf Robust Learning for Boosted Ensembles.}
\label{robust-learning-section}
Adversarial examples are imperceptible perturbations to a
correctly predicted input that cause incorrect predictions
by the model in question.
Consider an input region $\Delta_{\epsilon}(\mathbf{x})$
centered around a datapoint $\mathbf{x}$ with a radius of
$\epsilon$ under the $L_{\infty}$ norm:
\begin{equation}
    \Delta_{\epsilon}(\mathbf{x}) = \left\{\mathbf{x}' \in \mathbb{R}^m \mid \|\mathbf{x}' - \mathbf{x}\|_{\infty} \leq \epsilon\right\}.
\end{equation}

Given an input $\mathbf{x}$, an adversarial example
$\mathbf{x}'$ can be computed as:
\begin{equation}
    \label{eq:adv-example}
    \mathbf{x}' = \operatorname*{argmax}_{\mathbf{x}' \in
        \Delta_{\epsilon}(\mathbf{x})}
        \{L\left(F\left(\mathbf{x}'\right), y\right)\}.
\end{equation}
Thus, an ensemble can be trained to be robust against
adversarial examples by minimising the loss under the
worst-case adversarial perturbation for each training
example, as formulated by \cite{madry2018minmax}: 
\begin{equation}
\label{eq:rob-training}
\min _F \sum_{i=1}^n \max _{\mathbf{x}' \in \Delta_{\epsilon}(\mathbf{x_i})} L\left(F\left(\mathbf{x'}\right), y_i\right).
\end{equation}
In tandem with the minimisation of this adversarial loss,
the construction of a weak learner at iteration $t$ of
gradient boosting requires the optimisation of the following
loss function:
\begin{equation}
    \begin{split}
        \mathcal{L}^{(t)}_{rob} = \sum_{i=1}^n
        \max_{\mathbf{x}' \in
        \Delta_{\epsilon}(\mathbf{x}_i)} l\left(y_i,
    \sum_{k=1}^{(t-1)} f_k\left(\mathbf{x'}\right)
+f_t\left(\mathbf{x}'\right)\right).
    \end{split}
\end{equation}

