\section{Robust XGBoost Trees}\label{sec:rob-loss}

In this section we introduce a robust splitting criterion which we
incorporate into the XGBoost algorithm. Differently from conventional
approaches, which target the minimisation of the loss function with
respect to the training data, our method targets the robustness of
the derived trees to adversarial perturbations. This is achieved by
integrating an analytical upper bound of the adversarial loss within
the recursive splitting procedure of the individual decision trees.

\subsection{Robust Splitting in XGBoost Trees}
\label{sec:robust-splitting-criterion}

At the core of our  robust training procedure is a splitting
criterion that modifies Equation
\ref{eq:xgb-split-score} to incorporate the worst-case
adversarial loss within the recursive splitting procedure
of individual decision trees. Instead of simply considering
static sets of points $\mathcal{I}_L$ and $\mathcal{I}_R$
for the left and right child nodes, this new formulation
additionally 
contains the ambiguity set $\Delta \mathcal{I}$, which
contains all data points that could change child nodes under
an adversarial perturbation. We here consider perturbations
 within an $L_{\infty}$ ball centred around a 
 training data point~\eqref{eq:adv-example}.
When using axis-aligned splits, the computation of the  worst-case robust
loss function with respect to these perturbations can be
computed by treating each feature independently.
In particular, when splitting on feature $j$, perturbations
on other features have no impact on which child node a
data point ends up in; therefore only perturbations of $\pm
\epsilon$ along feature $j$ need to be considered.


To define and analyse the robust splitting criterion, we
borrow the following notation from~\cite{chen2019training}
on  various sets of data points for a given split with a
threshold $\eta$ on feature $j$.

\begin{table}[h]
\centering
\caption{Definitions of all sets used in the robust splitting
criterion, when considering a split on feature $j$ with threshold
$\eta$, under an $L_{\infty}$ perturbation of radius $\epsilon$.}
{\fontsize{9}{11}\selectfont
\begin{tabular}{cc}
\toprule
\textbf{Notation} & \textbf{Definition} \\
\midrule
$\mathcal{I}$  & Set of examples in the node being split \\
$\mathcal{I}_L$ & $\mathcal{I} \cap \{(\mathbf{x_i}, y_i)|x_i^{(j)} \leq \eta\}$ \\
$\mathcal{I}_R$ & $\mathcal{I} \cap \{(\mathbf{x_i}, y_i)|x_i^{(j)} > \eta\}$ \\
$\Delta \mathcal{I}$ & $\mathcal{I} \cap \{(\mathbf{x_i}, y_i)|\eta - \epsilon < x_i^{(j)} \leq \eta + \epsilon\}$  \\
$\mathcal{I}_L^0$ & $\mathcal{I}_L \setminus \Delta \mathcal{I}$ \\
$\mathcal{I}_R^0$ & $\mathcal{I}_R \setminus \Delta \mathcal{I}$ \\

\bottomrule
\end{tabular}
}

\label{tab:notation}
\end{table}

Intuitively, the ambiguity set $\Delta \mathcal{I}$ contains 
all points that could switch child nodes under an adversarial
perturbation. In the context of an $L_{\infty}$ attack 
model, this essentially corresponds to all points that are
within a distance of $\epsilon$ from the threshold $\eta$.
The sets $\mathcal{I}_L^0$ and $\mathcal{I}_R^0$ 
contain all points that are further than $\epsilon$ from the
threshold $\eta$ and are thus guaranteed to remain in the
left and right child nodes respectively under any
perturbation.

The robust splitting criterion can then be defined as:
\begin{equation}
    \label{eq:robust-splitting-criterion-exact}
    \begin{split}
        \mathcal{S}_{\text{rob}}(\mathcal{I}_L^0, \mathcal{I}_R^0, \Delta \mathcal{I}) =& \min_{r_i} \frac{1}{2}\Biggl[  \frac{\left(\sum_{i \in \mathcal{I}_L^0} g_i + \sum_{i \in \Delta\mathcal{I}} r_i g_i \right)^2}{\sum_{i \in \mathcal{I}_L^0} h_i + \sum_{i \in \Delta\mathcal{I}} r_i h_i + \lambda} \\[1ex]
        + & \frac{\left(\sum_{i \in \mathcal{I}_R^0} g_i + \sum_{i \in \Delta\mathcal{I}} (1 - r_i) g_i \right)^2}{\sum_{i \in \mathcal{I}_R^0} h_i + \sum_{i \in \Delta\mathcal{I}} (1 - r_i) h_i + \lambda} \\[1ex]
        - & \frac{\left(\sum_{i \in I} g_i\right)^2}{\sum_{i
    \in I} h_i + \lambda} \Biggr] - \gamma,
    \end{split}
\end{equation}
where $r_i$ is a binary variable that indicates whether a
datapoint $i$ in the ambiguity set $\Delta \mathcal{I}$
moves to the left child node. Computing the optimal value of
$r_i$ is a combinatorial optimisation problem with
exponential complexity, and thus computationally
intractable to solve for 
$\mathcal{O}(\|\mathcal{I}\|m)$ candidate splits.

We can instead derive a lower bound to the problem by
considering a linear relaxation on the binary variables
$r_i$:
\begin{equation}
    \label{eq:robust-splitting-criterion}
    \begin{split}
        \mathcal{S}_{\text{rob}}(\mathcal{I}_L^0, \mathcal{I}_R^0, \Delta \mathcal{I}) &\geq \mathcal{S}_{\text{rob}}^{\text{lb}} = \min_{p,q} \frac{1}{2}\Biggl[  \frac{\left(\sum_{i \in \mathcal{I}_L^0} g_i + p \right)^2}{\sum_{i \in \mathcal{I}_L^0} h_i + q + \lambda} \\[1ex]
        + & \frac{\left(\sum_{i \in \mathcal{I}_R^0} g_i + \sum_{i \in \Delta\mathcal{I}} g_i - p \right)^2}{\sum_{i \in \mathcal{I}_R^0} h_i + \sum_{i \in \Delta\mathcal{I}} h_i + \lambda} \\[1ex]
        - & \frac{\left(\sum_{i \in I} g_i\right)^2}{\sum_{i
    \in I} h_i + \lambda} \Biggr] - \gamma,
    \end{split}
\end{equation}
where $p$ and $q$ are continuous variables that represent
the sum of the first and second derivatives of the points
from the ambiguity set that move to the left child node.
This relaxation results in a continuous optimisation problem
with an analytical solution that can be computed in constant
time. The lower bound for the robust split score given by
this solution can be used to upper bound the robust loss
function and can therefore be used to evaluate candidate
splits towards optimising the latter upper bound.

\subsection{Tightening the Linear Relaxation}

The previously described linear relaxation of the robust 
splitting criterion is generally quite loose because the 
optimal values for $p$ and $q$ may not satisfy the binary 
constraints imposed on $r_i$. This discrepancy can lead 
to substantial under-approximation errors in the robust 
split scores, effectively diminishing the distinction 
between high-quality and low-quality candidate splits. 
As a result, the decision tree may fail to identify 
effective splits, ultimately compromising its predictive 
performance.

To alleviate this shortcoming, we now tighten the relaxation
by introducing box constraints around the minimum and
maximum values of $p$ and $q$. In particular, we observe the
following:

\begin{itemize}
    \item The first derivative of the loss function can be
        positive or negative, therefore the minimum value of
        $p$ is the sum of all negative elements in the set
        $\{ g_i \mid i \in \Delta \mathcal{I} \}$, and the
        maximum value of $p$ is the sum of all positive
        elements in the set $\{ g_i \mid i \in \Delta
        \mathcal{I} \}$. Hence, $p \in \left[\sum_{i \in
    \Delta \mathcal{I} } \min(0, g_i), \sum_{i \in \Delta
\mathcal{I} } \max(0, g_i) \right]$.
    
    \item The second derivative of any convex loss function
        is always positive, therefore $q \in \left[0,
        \sum_{i \in \Delta \mathcal{I} } h_i \right]$.
\end{itemize}

While the box constraints greatly tighten the linear 
relaxation, they do not capture the combinatorial nature of 
the binary variables $r_i$. We can further tighten the 
relaxation by approximating the feasible region of the 
values $p$ and $q$ that are consistent with the points
in the ambiguity set moving between the left and right
child nodes. In particular, we aim to capture the 
constraint that if a point $i$ moves to the left node,
then \textit{both} the first and the second derivatives
of the point must contribute to the sums $p$ and $q$, 
and vice versa. This can be achieved by considering the
maximum and minimum values of the first and second 
derivatives of the points in the ambiguity set.


To define the constraints we introduce some preliminary
notation. Let $u$ be an auxiliary variable that denotes the
number of points in the ambiguity set that move to the left
child node. Let $g_{\text{min}}^{(\Delta \mathcal{I})}$,
$g_{\text{max}}^{(\Delta \mathcal{I})}$,
$h_{\text{min}}^{(\Delta \mathcal{I})}$ and
$h_{\text{max}}^{(\Delta \mathcal{I})}$ be the minimum and
maximum values of the first and second derivatives of the
points in the ambiguity set, respectively. Furthermore, let
$G^{(\Delta \mathcal{I})}$ and $H^{(\Delta \mathcal{I})}$ be
the sums of the first and second derivatives of the points
in the ambiguity set. These can be used to construct linear
constraints between $p$, $q$ and $u$ as follows:
\begin{equation}
\setlength{\jot}{10pt}
\begin{gathered}
    g_{\text{min}}^{(\Delta \mathcal{I})}\cdot u \leq p \leq g_{\text{max}}^{(\Delta \mathcal{I})}\cdot u, \\
    h_{\text{min}}^{(\Delta \mathcal{I})}\cdot u \leq q \leq h_{\text{max}}^{(\Delta \mathcal{I})}\cdot u, \\
    g_{\text{min}}^{(\Delta \mathcal{I})}\,(|\Delta \mathcal{I}| - u) \leq G^{(\Delta \mathcal{I})} - p \leq g_{\text{max}}^{(\Delta \mathcal{I})}\,(|\Delta \mathcal{I}| - u), \\
    h_{\text{min}}^{(\Delta \mathcal{I})}\,(|\Delta \mathcal{I}| - u) \leq H^{(\Delta \mathcal{I})} - q \leq h_{\text{max}}^{(\Delta \mathcal{I})}\,(|\Delta \mathcal{I}| - u).
\end{gathered}
\end{equation}
Solving for the auxiliary variable in these inequalities
results in the following linear constraints between $p$ and $q$:
\begin{equation}
    \label{eq:rob_constraints}
    \setlength{\jot}{10pt}
    \begin{gathered}
        p \leq q \cdot c_1, \\
        p \geq q \cdot c_2, \\
        p \geq G^{(\Delta \mathcal{I})} - c_1(H^{(\Delta \mathcal{I})} - q), \\
        p \leq G^{(\Delta \mathcal{I})} - c_2(H^{(\Delta
        \mathcal{I})} - q),\\
    \end{gathered}
\end{equation}\
where the parameters \(c_1\) and \(c_2\) are defined as follows:
\begin{equation}
    c_1 =
    \begin{cases}
    \tfrac{g_{\max}^{(\Delta\mathcal{I})}}{h_{\min}^{(\Delta\mathcal{I})}}, & g_{\max}^{(\Delta\mathcal{I})} \ge 0, \\[2ex]
    \tfrac{g_{\max}^{(\Delta\mathcal{I})}}{h_{\max}^{(\Delta\mathcal{I})}}, & g_{\max}^{(\Delta\mathcal{I})} < 0,
    \end{cases}
    \quad
    c_2 =
    \begin{cases}
    \tfrac{g_{\min}^{(\Delta\mathcal{I})}}{h_{\max}^{(\Delta\mathcal{I})}}, & g_{\min}^{(\Delta\mathcal{I})} \ge 0, \\[2ex]
    \tfrac{g_{\min}^{(\Delta\mathcal{I})}}{h_{\min}^{(\Delta\mathcal{I})}}, & g_{\min}^{(\Delta\mathcal{I})} < 0.
    \end{cases}
\end{equation}

The optimal values of $p$ and $q$ can be computed analytically
in constant time by solving the constrained optimisation problem.
The detailed analytical solution is given in Appendix \ref{sec:appendix-analytical-solution}.


We have thus derived  a tight formulation for the robust
splitting criterion that can be used to evaluate candidate
splits in the tree building procedure. The key advantages of
this formulation are threefold: (i) it can be solved
analytically in constant time,  (ii) it can be integrated
within the XGBoost algorithm to build robust gradient
boosted ensembles with limited computational overhead, and
(iii) it is agnostic to the choice of loss function, and can
thus be used with any differentiable loss function for
various tasks  such as regression, classification, and
ranking.

\subsection{Constructing Robust Trees}

We now  integrate  the robust splitting criterion with
the XGBoost algorithm. The underlying  tree building
procedure is modified to evaluate candidate splits using the
robust splitting criterion, and select the split that
maximises the robust score.

Similarly to previous work \citep{fprdt2022}
\citep{vos2021groot}, we consider all features $j \in [m]$
and thresholds $\eta \in W_j$ as candidate splits, where 
\begin{equation}
    W_j = \bigcup_{i \in \mathcal{I}} \{x^{(j)}_i -
    \epsilon, x^{(j)}_i , x^{(j)}_i + \epsilon \}.
\end{equation}
To evaluate the robust score for each candidate split using
Equation \ref{eq:robust-splitting-criterion}, the values of
$\sum_{i \in \mathcal{I}_L^0} g_i$, $\sum_{i \in
\mathcal{I}_R^0} g_i$, $\sum_{i \in \Delta \mathcal{I}}
g_i$, $\sum_{i \in \mathcal{I}_L^0} h_i$, $\sum_{i \in
\mathcal{I}_R^0} h_i$, $\sum_{i \in \Delta \mathcal{I}}
h_i$, $g_{\text{min}}^{(\Delta\mathcal{I})}$,
$g_{\text{max}}^{(\Delta\mathcal{I})}$,
$h_{\text{min}}^{(\Delta\mathcal{I})}$ and
$h_{\text{max}}^{(\Delta\mathcal{I})}$ for each feature $j$
and threshold $\eta \in W_j$ need to be computed. All
thresholds are efficiently evaluated by considering a sorted
$W_j$ and maintaining running sums, minimums and maximums of
the first and second derivatives for the fixed and ambiguous
sets respectively. This enables the robust split score to be
computed in constant time for each candidate split. The 
proposed algorithm iterates over a larger set of candidate
splits compared to the exact XGBoost algorithm, however this
presents a constant time overhead for the overall training
procedure, and the exploration of the additional 
candidate splits leads to more robust trees in practice.


The sorting operation has a time complexity of
$\mathcal{O}(n\log n)$, and the evaluation of the robust
score for each candidate split has a time complexity of
$\mathcal{O}(1)$. Thus,  the  overall time complexity of
finding an optimal split over $m$ features is
$\mathcal{O}(mn\log n)$. This is the same as the time
complexity of the exact XGBoost tree building procedure. The
detailed algorithm for constructing robust trees is provided
in Algorithm~\ref{alg:robust-splitting}.

Once an optimal split is identified, we assign leaf weights
to the left and right child nodes based on the optimal
values of $p_\eta$ and $q_\eta$ of the threshold that
maximises the robust splitting criterion:
\begin{equation}
    \label{eq:leaf-weights-robust}
    \begin{gathered}
        w_L^*=-\frac{\sum_{i \in \mathcal{I}_L^0} g_i + p_\eta}{{\sum_{i \in \mathcal{I}_L^0} h_i + q_\eta} +\lambda} \\[1ex]
        w_R^*=-\frac{\sum_{i \in \mathcal{I}_R^0} g_i +
        \sum_{i \in \Delta\mathcal{I}} g_i -
    p_\eta}{{\sum_{i \in \mathcal{I}_R^0} h_i + \sum_{i \in
    \Delta\mathcal{I}} h_i - q_\eta} +\lambda}.
    \end{gathered}
\end{equation}

Finally, as in conventional greedy tree building procedures
\citep{Mingers1989-pruning} and in the XGBoost algorithm, we
apply pruning to the tree once it is constructed. We start
with the leaf nodes and iteratively prune the tree by
removing nodes that have a robust splitting score below
$\gamma$. This post-training pruning step accounts for the
greediness of the building procedure which cannot
guarantee that the robust loss is decreasing at successive
splits.  As opposed to alternative methods of early
stopping~\citep{fprdt2022}, by pruning after training, we
allow certain splits to be made that may not lead to a local
loss reduction, but may have a greater loss reduction in
subsequent splits.

Thus, we propose a \textbf{greedy certified} training method, that 
is guaranteed to minimise the upper bound of the robust loss function is minimised at each split. In principle,
this should lead to significantly more robust trees than a
heuristic approach to estimate the robust loss function per 
split.


\begin{algorithm}[h]
\SetAlgoLined\SetArgSty{}

\SetKwInput{KwSmallIn}{in}
\SetKwInput{KwUpdate}{update}
\KwIn{Training set $\mathcal{D} = \{(x_i, y_i)\}|_{i=1}^n$, $x_i \in \mathbb{R}^m$, $y_i \in \mathbb{R}$, $\epsilon$, the radius of the $L_{\infty}$ ball}
\KwIn{Node indices $\mathcal{I}$, per-instance gradients $\{g_i\}_{i\in \mathcal{I}}$, Hessians $\{h_i\}_{i\in \mathcal{I}}$, regularization parameter $\lambda$, minimum gain $\gamma$}

\BlankLine

\BlankLine
\For{$j \gets 1$ \KwTo $m$}{
  

  \For{$i$ in sorted$(\mathcal{I}, \text{ ascending by } x_i^j)$}{
    \For{$\eta \in \{x_i^j - \epsilon, x_i^j ,  x_i^j + \epsilon\} $}{
        $\mathcal{I}_L^{0} = \{(\mathbf{x_i}, y_i)|x_i^{j} \leq \eta - \epsilon\}$\;
        $\mathcal{I}_R^{0} = \{(\mathbf{x_i}, y_i)|x_i^{j} > \eta + \epsilon\}$\;
        $\Delta \mathcal{I} = \{(\mathbf{x_i}, y_i)|\eta - \epsilon < x_i^{(j)} \leq \eta + \epsilon\}$ \;
        
        Update sums $\sum_{i \in \mathcal{I}_L^0}g_i$, $\sum_{i \in \mathcal{I}_R^0} g_i$, $\sum_{i \in \Delta \mathcal{I}} g_i$, $\sum_{i \in \mathcal{I}_L^0} h_i$, $\sum_{i \in \mathcal{I}_R^0} h_i$, $\sum_{i \in \Delta \mathcal{I}} h_i$\;
        %  \;
        
        Update values $g_{\text{min}}^{(\Delta \mathcal{I})}$, $g_{\text{max}}^{(\Delta \mathcal{I})}$, $h_{\text{min}}^{(\Delta \mathcal{I})}$ and $h_{\text{max}}^{(\Delta \mathcal{I})}$\;

        $\mathcal{S}_{\text{rob}}^{\text{lb}}$ $\gets$ Lower bound of robust split score from Equation \ref{eq:robust-splitting-criterion}\;
        $p_\eta$, $q_\eta$ $\gets$ $\operatorname*{argmin}\mathcal{S}_{\text{rob}}^{\text{lb}}$\;
    }
  }
}
$m^*$, $\eta^*$, $p^*_\eta$, $q^*_\eta$ $\gets$ $\operatorname*{argmax}\mathcal{S}_{\text{rob}}^{\text{lb}}$\;
\BlankLine

\KwOut{Best split: feature $m^*$, threshold $\eta^*$, optimal values $p^*_\eta$, $q^*_\eta$}
\caption{Robust Splits for XGBoost Trees}
\label{alg:robust-splitting}
\end{algorithm}




