

\section{Experimental Evaluation}\label{sec:evaluation}

We evaluate the robust training procedure introduced in the previous
section by comparing it with the state-of-the-art approaches on a variety of
datasets and with respect to different performance metrics.

{\bf Performance metrics.} Given a sample $(\mathbf{x}, y)$, let
$\hat{y}_\text{adv}$ denote the output of the model $F$ under a
worst-case adversarial perturbation on the input $\mathbf x$. For
regression tasks with a mean squared error loss, $\hat{y}_\text{adv}$
is equal to:
\begin{equation}
    \hat{y}_\text{adv} = F(\mathbf{x}')\quad \text{where}
    \quad\mathbf{x}' = \operatorname*{argmax}_{\mathbf{x}'
    \in \Delta_{\epsilon}(\mathbf{x})}\left|F(\mathbf{x}')
    - y\right|.
\end{equation}
On the basis of $\hat{y}_\text{adv}$ we define the robust $R^2$ score as:
\begin{equation}
    R^2_{\text{rob}} = 1 - \frac{\sum_i^n(y_i-\hat{y}_{\text{adv},i})^2}{\sum_i^n(y_i - \bar{y})^2},
\end{equation}

where $\bar{y}$ is the mean of the labels on the test dataset. 
This can be used to effectively summarise the lower bound
of the performance over a test set under adversarial attacks
of radius $\epsilon$. 
Furthermore, given $\hat{y} =
F(\mathbf{x})$, let $\hat{y}_\text{dev}$ denote the output of the
model that maximises the deviation from the true output,
i.e., 
\begin{equation}
    \hat{y}_\text{dev} = F(\mathbf{x}')\quad \text{where}
    \quad\mathbf{x}' = \operatorname*{argmax}_{\mathbf{x}'
    \in \Delta_{\epsilon}(\mathbf{x})}\left|F(\mathbf{x}')
    - \hat{y}\right|.
\end{equation}
On the basis of $\hat{y}_\text{dev}$, we define the robust
mean absolute deviation (MAD) as:
\begin{equation}
    \text{MAD} = \frac{1}{n}\sum^n_i |\hat{y_i}-\hat{y}_{\text{dev},i}|.
\end{equation}
The MAD metric gives an upper bound to the average deviation
in the model's 
predictions under $L_\infty$ norm-bounded adversarial
attacks. In the following, we use the MILP-based method
originally introduced in \cite{Kantchelian} to
compute the outputs $\hat{y}_\text{adv}$ and $\hat{y}_\text{dev}$
for each test point. 

In addition to $R^2_{\text{rob}}$ and $\text{MAD}$, we introduce a 
robust accuracy metric, analogous to those used in the literature for 
classification tasks. 
In our case, we define robust accuracy as the proportion of test data 
points for which 
the model’s output remains within a specified threshold $\tau$ of the true 
label under adversarial perturbations, i.e., a point is deemed robust 
if $|\hat{y}_\text{adv}-y| \leq \tau$.

To ensure consistency in our  evaluation across the various
datasets (which have different label ranges), we vary the threshold $\tau$, weighted by the 
range of output values in the test set, i.e., $\tau = w_\tau \cdot (\max\{y\} -
\min\{y\})$.

We refer to the robust accuracy metric as
$\text{acc}_{\text{rob}}^{(w_\tau)}$, where, for our experiments,
we consider $w_\tau \in \{0.2, 0.4\}$. The metric allows us
to evaluate the robustness of a model against perturbations
that cause a change in the output by a specified fraction of
the range of the output values. 

{\bf Baselines.}  We consider two baseline methods for
evaluating the performance of our robust training method:

\begin{itemize}
    
    \item \textbf{XGBoost} \citep{chen2016xgboost}: The
        conventional XGBoost method with the mean squared
        error objective.
        \item \textbf{Robust-GBDT} \footnote{Code taken from
            \href{https://github.com/chenhongge/RobustTrees}{https://github.com/chenhongge/RobustTrees}}
            \citep{chen2019training}: The robust XGBoost
            training method that uses heuristics to estimate
            the worst-case robust loss at a split. To our
            knowledge, this is the only method in the
            literature that directly supports robust
            training for general loss functions.
\end{itemize}
 
\begin{table}[h]
    \caption{Average results over 19 regression datasets for
    the proposed robust-splitting criterion, Robust-GBDT,
and XGBoost under an $L_\infty$ adversarial attack for
various $\epsilon$ values.
$\frac{\text{MAD}}{y_{\text{range}}}$ refers to the mean
absolute deviation of the model normalised by the range of
the output values in the test set. }
    \label{tab:average_results_all_eps}
    \centering
    {\fontsize{8}{10}\selectfont
    \begin{tabular}{@{}ccccccc@{}}
    \toprule
    $\epsilon$                    & Method      & $R^2$   & $R^2_{\text{rob}}$ & $\frac{\text{MAD}}{y_{\text{range}}}$ & $\text{acc}_{\text{rob}}^{(0.2)}$ & $\text{acc}_{\text{rob}}^{(0.4)}$  \\ \midrule
    \multirow{3}{*}{0.005} & xgboost & 0.678 & 0.220   & 0.062          & 0.794         & 0.967                \\
                           & RGBDT   & \textbf{0.701} & 0.572   & 0.032          & 0.895         & 0.985                \\
                           & ours     & 0.690 & \textbf{0.636}   & \textbf{0.015}          & \textbf{0.910}         & \textbf{0.987}                \\ \midrule
    \multirow{3}{*}{0.01}  & xgboost & 0.678 & -0.186  & 0.093          & 0.730         & 0.929                \\
                           & RGBDT   & \textbf{0.697} & 0.421   & 0.050          & 0.855         & 0.976                \\
                           & ours     & 0.669 & \textbf{0.582}   & \textbf{0.023}          & \textbf{0.896}         & \textbf{0.986}                \\ \midrule
    \multirow{3}{*}{0.05}  & xgboost & \textbf{0.678} & -3.828  & 0.278          & 0.382         & 0.683                \\
                           & RGBDT   & 0.632 & -1.268  & 0.166          & 0.564         & 0.856                \\
                           & ours     & 0.521 & \textbf{0.189}   & \textbf{0.049}          & \textbf{0.813}         & \textbf{0.966}                \\ \midrule
    \multirow{3}{*}{0.1}   & xgboost & \textbf{0.678} & -7.191  & 0.396          & 0.190         & 0.505                \\
                           & RGBDT   & 0.565 & -4.440  & 0.255          & 0.371         & 0.719                \\
                           & ours     & 0.301 & \textbf{-0.049}  & \textbf{0.055}          & \textbf{0.742}         & \textbf{0.945}                \\ \bottomrule
\end{tabular}
}
\end{table}

{\bf Datasets and Hyperparameters.} 
We consider 19 regression datasets from a widely used tabular data
benchmark introduced in \cite{dt-better-than-nn}. This 
benchmark is chosen as it provides a diverse collection of 
real-world regression tasks, facilitating a comprehensive
evaluation of the robustness of the proposed method. As our
approach uses the $L_\infty$ attack model, which is only
applicable on continuous features, we consider the datasets
in the benchmark that only contain continuous features. 

We tune the hyperparameters of the three approaches to maximise the
conventional $R^2$ score on the validation set of each dataset, as this
approach closely mirrors the application of such models in practice. The
hyperparameters for the XGBoost baseline are obtained from the same benchmark, which presents the best grid-search
hyperparameters on each dataset. For the two robust
methods, we conduct a grid-search over the maximum tree depth, the
$L_2$ regularisation parameter $\lambda$, and the minimum loss reduction 
$\gamma$, using the hyperparameters from the XGBoost baseline as a starting point. We further limit the number of trees in the ensemble to 100 to mitigate the scalability issues associated with certifying large models with the MILP solver. 


We scale all feature values in the training data to $[0,1]$
to ensure uniformity in the perturbations across features.
We conduct experiments on the datasets with a range of
adversarial perturbation radii $\epsilon \in \{0.005, 0.01,
0.05, 0.1\} $. All results are obtained by averaging over a
5-fold cross-validation.

\begin{table*}[t]
    \caption{Comparisons of standard and robust regression metrics for the proposed robust-splitting criterion, Robust-GBDT, and XGBoost over 19 regression benchmark datasets for an $L_{\infty}$ adversarial attack of radius $\epsilon = 0.05$. 
    % The $y_{\text{range}}$ column denotes the range of the output values in the dataset. 
    The $\text{MAD}_{\text{ratio}}$ column describes the ratio of MAD obtained by the baseline compared to our approach.}
    \label{tab:summary_results_eps_0_05}
    \centering
    {\fontsize{7.6}{9}\selectfont
    \begin{tabular}{@{}c|ccc|ccc|ccc|c|cc@{}}
    \toprule
    \multirow{2}{*}{Dataset} & \multicolumn{3}{c|}{ours} & \multicolumn{3}{c|}{RGBDT} & \multicolumn{3}{c|}{xgboost} & \multirow{2}{*}{$y_{\text{range}}$} & \multicolumn{2}{c}{$\text{MAD}_{\text{ratio}}$} \\
                                & $R^2$      & $R^2_{\text{rob}}$  & $\text{MAD}$ & $R^2$       & $R^2_{\text{rob}}$   & $\text{MAD}$   & $R^2$         & $R^2_{\text{rob}}$      & $\text{MAD}$      &                           & RGBDT             &        XGBoost       \\ \midrule
    Ailerons                  & 0.742    & \textbf{0.431}         & 0.000          & \textbf{0.765}        & 0.360             & 0.000              & 0.758                & 0.365                     & 0.000                      & 0.003    &     -                  &  -                    \\
Bike\_Sharing\_Demand     & 0.508    & \textbf{0.049}         & \textbf{31.434}         & 0.605        & -0.422            & 52.997             & \textbf{0.613}                & -0.328                    & 43.569                     & 410.200  & 1.686                 & 1.386                \\
Brazilian\_houses         & 0.745    & \textbf{0.626}         & \textbf{0.253}         & 0.838        & -0.100            & 0.768              & \textbf{0.986}                & -18.806                   & 3.507                      & 4.776    & 3.036                 & 13.862               \\
MiamiHousing2016          & 0.722    & \textbf{0.148}         & \textbf{0.369}          & 0.830        & -0.727            & 0.708              & \textbf{0.865}                & -0.991                    & 0.743                      & 3.533    & 1.919                 & 2.014                \\
abalone                   & 0.323    & \textbf{0.203}         & \textbf{0.622}          & \textbf{0.465}        & -0.280            & 1.962              & 0.364                & -0.815                    & 2.741                      & 22.400   & 3.154                 & 4.407                \\
cpu\_act                  & 0.979    & \textbf{0.933}        & \textbf{3.289}          & \textbf{0.984}        & 0.882             & 5.475              & 0.975                & -16.297                   & 78.840                     & 99.000   & 1.665                 & 23.971               \\
delays\_zurich\_transport & 0.055    & \textbf{0.009}         & \textbf{0.120}          & \textbf{0.108}        & -0.154            & 0.619              & 0.101                & -0.459                    & 1.060                      & 10.272   & 5.158                 & 8.833                \\
diamonds                  & 0.939    & \textbf{0.382}         & \textbf{0.120}          & 0.957        & -0.009            & 0.361              & \textbf{0.966}                & -2.341                    & 0.731                      & 2.264    & 3.008                 & 6.092                \\
elevators                 & 0.597    & \textbf{-0.090}        & \textbf{0.004}          & \textbf{0.823}        & -5.457            & 0.016              & 0.822                & -4.351                    & 0.015                      & 0.057    & 4.000                 & 3.750                \\
house\_16H                & 0.255    & \textbf{0.052}         & \textbf{0.165 }         & \textbf{0.320}        & -0.287            & 0.393              & 0.295                & -2.199                    & 1.037                      & 11.112   & 2.382                 & 6.285                \\
house\_sales              & 0.741    & \textbf{0.440}        & \textbf{0.193}          & 0.830        & -0.184            & 0.474              & \textbf{0.836}                & -0.478                    & 0.521                      & 3.745    & 2.456                 & 2.699                \\
houses                    & 0.750    & \textbf{0.438}         & \textbf{0.248}          & 0.788        & -0.297            & 0.502              & \textbf{0.853}                & -1.740                    & 0.794                      & 2.671    & 2.024                 & 3.202                \\
medical\_charges          & -0.004   & \textbf{-0.220}        & \textbf{0.111}         & 0.412        & -7.816            & 0.787              & \textbf{0.908}                & -5.596                    & 0.748                      & 1.852    & 7.090                 & 6.739                \\
nyc-taxi-green-dec-2016   & 0.003    & \textbf{-0.003}       & \textbf{0.005}          & 0.193        & -3.068            & 1.029              & \textbf{0.440}                & -7.020                    & 1.498                      & 3.270    & 205.800               & 299.600              \\
pol                       & 0.943    & \textbf{0.519}         & \textbf{16.551}        & 0.969        & 0.376             & 21.964             & \textbf{0.976}                & 0.152                     & 27.053                     & 100.000  & 1.327                 & 1.635                \\
sulfur                    & 0.780    & \textbf{-0.793}        & \textbf{0.035 }         & \textbf{0.918}        & -2.205            & 0.077              & 0.909                & -3.372                    & 0.090                      & 0.861    & 2.200                 & 2.571                \\
superconduct              & 0.573    & \textbf{0.402}         & \textbf{5.242}          & 0.757        & -1.855            & 40.668             & \textbf{0.784}                & -3.442                    & 54.040                     & 129.840  & 7.758                 & 10.309               \\
wine\_quality             & 0.245    & \textbf{0.063}         & \textbf{0.156}          & 0.381        & -1.305            & 0.830              & \textbf{0.390}                & -1.752                    & 0.927                      & 5.400    & 5.321                 & 5.942                \\
yprop\_4\_1               & 0.003    & \textbf{-0.004}        & \textbf{0.000}          & \textbf{0.062}        & -1.553            & 0.024              & 0.049                & -3.253                    & 0.041                      & 0.149    & inf                   & inf                   \\ \bottomrule
    \end{tabular}
    }
\end{table*}

The average performance of the proposed robust-splitting criterion,
Robust-GBDT, and XGBoost over the 19 regression datasets for various
$\epsilon$ values is outlined in Table
\ref{tab:average_results_all_eps}. We present more granular results across different datasets for a
perturbation radius of $0.05$ in
Table~\ref{tab:summary_results_eps_0_05}. Results for other values of
perturbation radii can be found in Tables
\ref{tab:summary_results_eps_0_005},
\ref{tab:summary_results_eps_0_01} and
\ref{tab:summary_results_eps_0_1} in the appendix.  The results clearly demonstrate the
improved robustness of the proposed method compared to the baselines,
especially when observing larger perturbation radii. The results 
also highlight fragilities with the
conventional XGBoost models, which appear unable to maintain their
performance under adversarial perturbations. Indeed, even small input
perturbations lead to large deviations in the output values,
completely degrading its predictive performance, as indicated by the
large negative $R^2_{\text{rob}}$ values. This strongly
underscores the need for considering robustness in regression tasks.


In contrast, the Robust-GBDT method obtains much higher robustness
than the conventional XGBoost model, and exhibits good performance
when considering small perturbations. However, it is significantly
less effective than the proposed robust-splitting criterion at higher
perturbation radii, where it obtains large negative values of
$R^2_{\text{rob}}$ in some cases.
The table additionally shows that the Robust-GBDT
method obtains higher standard $R^2$ scores than the proposed
method. This can be intuitively explained by the fact that Robust-GBDT
uses a simple heuristic to estimate the worst-case robust loss, while
our method determines a provable upper bound to the robust loss at
each candidate split. This ultimately makes the proposed approach more
robust at the expense of a drop in predictive performance.


{\bf Comparison with exact robust loss.}
In addition to the experiments conducted on the tabular data benchmark, 
we evaluate the empirical performance and the tightness of the approximation
of the proposed lower bound of the robust splitting criterion and the heuristic from \cite{chen2019training}
against the exact robust loss, computed by exactly solving the
mixed integer optimisation problem in Eq. \ref{eq:robust-splitting-criterion-exact} in Figure 
\ref{fig:rob-split-datasets}. We observe that our method indeed
provides a principled lower bound to the exact solution, and with a much tighter approximation compared 
to the heuristic method. 

\begin{figure}[!h]
    \centering
    \includegraphics[width=\linewidth]{figures/some_datasets}
    \caption{Comparison of robust splitting scores obtained by our method, the heuristic from \cite{chen2019training}, and an exact solution to the robust splitting score from Equation \ref{eq:robust-splitting-criterion-exact} across threshold values, with $\epsilon=0.1$. These scores are computed from the root node of the first tree of the ensemble, on the optimal feature for splitting.}
    \label{fig:rob-split-datasets}
\end{figure}

Additionally, we compare the robustness profile of the proposed method, the exact splitting score, and the heuristic method from \cite{chen2019training} by 
computing and plotting the $R^2_{\text{rob}}$ of models under
varying perturbation radii $\epsilon$. As the exact robust loss
is extremely computationally expensive to compute, we limit our
evaluation to 5 datasets from the tabular data benchmark. A subset
of the results is presented in Figure \ref{fig:robustness-profile}
below, with the full results available at 
Figure \ref{fig:robustness-profile-full} in the appendix. We additionally 
compare the performance of the heuristic method with a relaxed perturbation radius (2x the original radius) to evaluate its robustness-performance trade-off.
\begin{figure}[!h]
    \centering
    \includegraphics[width=\linewidth]{figures/exact_vs_lb_vs_heuristic_smaller.pdf}
    \caption{Comparison of the robustness profile of the models obtained 
    by our method, the heuristic from \cite{chen2019training}, 
    the same heuristic set with a relaxed perturbation radius, and an exact 
    solution to the robust splitting score from Equation \ref
    {eq:robust-splitting-criterion-exact}.}
    \label{fig:robustness-profile}
\end{figure}

The empirical robustness results in Figure~\ref{fig:robustness-profile} and the tightness of the approximation shown in Figure~\ref{fig:rob-split-datasets} demonstrate that our proposed lower bound of the robust splitting score achieves performance closely aligned with the exact robust splitting score across diverse datasets and training radii. This 
suggests that the proposed lower bound yields empirically comparable models
in terms of both performance and robustness. In some cases, the relaxation even leads to more robust and accurate models. Thus, employing the
proposed lower bound of the robust splitting score closely approximates the
models obtained by the exact robust splitting score, while being significantly more computationally efficient.

Furthermore, the proposed approach leads to models with greater robustness compared to the heuristic approach, even with a relaxed radius, obtaining a very different robustness profile with a more favourable trade-off between robustness and predictive performance, which becomes more pronounced at higher perturbation radii. 

Indeed, we observe a robustness and predictive
performance trade-off between the three methods evaluated.
Nonetheless, we believe our proposed robust-splitting 
criterion is particularly compelling for applications where 
robustness is paramount. By delivering markedly improved 
empirical robustness against adversarial perturbations—especially at 
higher perturbation levels—our approach presents a 
valuable alternative to both conventional XGBoost and 
existing robust training methods.
