\section{Experiments}


% Dataset description
\paragraph{Datasets.}
We evaluate our GP-RFM-Laplace on a variety of regression tasks.
Specifically, we use two tabular regression benchmarks with datasets from UCI \citep{asuncion2007uci} and OpenML \citep{vanschoren2014openml}, respectively.
For the UCI benchmark we use 7 datasets inspired by \citet{duan2020ngboost} and
for the OpenML benchmark, we utilize the collection of 16 numerical regression datasets by \citet{grinsztajn2022tree}.



% Hyperparameter tuning and preprocessing
\paragraph{Hyperparameter tuning}
We follow the protocol proposed in \citet{hernandez2015probabilistic} for data splitting and hyperparameter tuning.
For the UCI benchmark, we follow \citet{duan2020ngboost} to hold out 10\% of the data as a test set.
For the OpenML benchmark, we follow \citet{grinsztajn2022tree} to hold out 30\% of the data as a test set.
The remaining data is split into a 70\% training set and a 30\% validation set to tune the hyperparameters.
We use grid-search over all combinations of hyperparameters and select the best hyperparameters based on the validation set NLL.
Details on the hyperparameter search space can be found in the appendix.
Finally, we train the model on the full training set and evaluate it on the test set.
The process is repeated for 20 random seeds and we report the mean and standard deviation of the results.



% Baseline methods
\paragraph{Baselines.}
We compare our GP-RFM-Laplace and its diagonal version GP-RFM-Laplace-diag to a variety of probabilistic baseline methods. The details are described in \Cref{sec:app-implementation-details}.
For GPs, we consider the standard \emph{RBF} and \emph{Laplace} kernel.
As a neural networks-based GP, we regard \emph{deep kernel learning} \citep{wilson2016deep}.
% where we learn the length scale parameter $\ell$ and the noise variance $\sigma$.
Additionally, we compare our method to kernels with ARD, specifically the \emph{ARD-RBF} \citep{neal1996bayesian} which is used in many settings and to the \emph{ARD-Laplace} kernel.
% where we learn the length scale parameters $\ell_1,\dots,\ell_d$ and the noise variance $\sigma$.
The latter is a rarely used kernel in GPs but is a natural extension of the Laplace kernel to incorporate covariate weighting, learnt through MLE optimization.
Finally, we use \emph{ARD-Laplace-full} as a direct counterpart to the RFM-Laplace with full weighting matrix but learnt through MLE here instead of AGOP \citep{vivarelli1998discovering}.

% boosting approaches
% NGBoost generalizes gradient boosting to probabilistic regression by treating the parameters of the conditional distribution as targets for a multiparameter boosting algorithm
Furthermore, we consider probabilistic extensions of boosting approaches, which are known to be powerful for predictive tasks.
Firstly, we use \emph{NGBoost} \citep{duan2020ngboost} which learns the parameters of a Gaussian distribution through boosting enhanced with a natural gradient update.
Secondly, we use \emph{CatBoost-Ensemble} \citep{malinin2021uncertainty} for which we use an ensemble of 10 gradient boosting-based models. From the ensemble, the predictive distribution is obtained by computing statistics of the individual predictions.
% preprocessing
Following \citet{duan2020ngboost}, we standardize covariates and labels to have zero mean and unit variance for all GP-based methods but not for the boosting-based methods.



% Evaluation metrics
\paragraph{Evaluation metrics.}
% RMSE, NLL, 95\% Coverage Error, Interval Length at 95\% Coverage
We are interested in the predictive performance of the models as well as their uncertainty quantification.
Therefore, we evaluate the models on their \emph{root mean squared error (RMSE)} as well as their \emph{NLL} on the test set.
We also require the model uncertainty to be calibrated, i.e. the predictive distribution should reflect the likelihood of prediction errors.
To evaluate calibration, we compute the \emph{95\% coverage error (CE)} which refers to the proportion of data points for which the 95\% prediction interval does not contain the true value.
For the model to be well-calibrated, the coverage should be 95\% and the corresponding CE should be zero.
Finally, we evaluate the \emph{interval length (IL)} of the 95\% confidence interval. This measure is important for models with similar CE since a smaller IL indicates a more precise uncertainty quantification.
















\subsection{Main results}
Here we present the main results of our experiments.
We compare our GP-RFM-Laplace to all baseline methods on the UCI and OpenML benchmark datasets.
% normalization as datasets have different scales
Due to varying scales, we normalize metrics for comparison across datasets. We achieve this by calculating the minimum and maximum values for each dataset across all methods and seeds, followed by normalizing results to the range $[0,1]$.
% results reference to figures and tables
The results for each dataset of the OpenML benchmark in terms of NLL and RMSE are in \Cref{tab:main_table,tab:main_table_rmse}, respectively. Summary figures for NLL, RMSE and CE are shown in \Cref{fig:main-tabular-benchmark} using violin plots to indicate the distribution of the results including a boxplot for the median and quartiles.
The results for the UCI benchmark are shown in \Cref{sec:app-results}, \Cref{fig:main-app}.
Note that the results for IL are omitted from the summary figure as comparing IL across datasets is not meaningful.
% A heatmap for timing in seconds for a selection of methods is shown in \Cref{fig:main-tabular-benchmark-time}.
Detailed performance results for each method on all datasets individually can be found in the \Cref{sec:details:tables}.


\input{figures/main_table}
\input{figures/main_table_rmse}


% results summary
We observe that both GP-RFM-Laplace variants are only outperformed by the CatBoost-Ensemble in terms of NLL.
However, the GP-RFM-Laplace is the best method in terms of RMSE, closely followed by the GP-ARD-Laplace.
Regarding calibration in terms of CE, we observe that the boosting methods are dominant, followed by the GP-RFM-Laplace.
Overall, both the GP-RFM-Laplace and the GP-ARD-Laplace perform similarly well across all metrics, demonstrating a competitive approach to boosting-based approaches for probabilistic regression.

% ARD-full comparison
Notably, the ARD-Laplace-full, serving as a complement to the RFM-Laplace, exhibits a significantly poorer performance while both methods utilize full feature matrices $\mM$. Directly optimising $\mM$ through MLE in the ARD-Laplace-full is challenging due to the increased complexity associated with often high-dimensional feature spaces. Hence, while the parameterization of both methods is equal, the RFM-based learning method of alternately solving convex problems seems to be simpler to optimise.


% % timing
% In terms of computation time, \Cref{fig:main-tabular-benchmark-time} includes training and testing time using official implementations of the respective methods. For details, see \Cref{sec:app-implementation-details}. We observe that over all datasets, the GP-based methods are comparable in timing. In contrast, the boosting-based methods, especially CatBoost-Ensembles are considerably slower. Results for all methods are in \Cref{tab:main_table_time_app}.
% \begin{figure}[htb!]
%     \centering
%     \includegraphics[width=0.9\columnwidth]{figures/heatmap_time_cropped.pdf}
%     \caption{Time for training and testing on OpenML benchmark. Rows correspond to rows in \Cref{tab:main_table}.}
%     \label{fig:main-tabular-benchmark-time}
% \end{figure}


\begin{figure}[htb!]
    \centering
    \begin{subfigure}[b]{0.475\textwidth}
        \centering
        \includegraphics[trim=0 48 0 0, clip, width=\textwidth]{figures/tabularbenchmark_nll.pdf}
    \end{subfigure}
    % \hfill
    \begin{subfigure}[b]{0.475\textwidth}
        \centering
        \includegraphics[trim=0 48 0 0, clip, width=\textwidth]{figures/tabularbenchmark_rmse.pdf}
    \end{subfigure}
    % \vskip\baselineskip
    \begin{subfigure}[b]{0.475\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/tabularbenchmark_ce.pdf}
    \end{subfigure}
    \caption{Violin plot results on the OpenML benchmark including boxplots with median and quartiles for each method.}
    \label{fig:main-tabular-benchmark}
\end{figure}















\subsection{Toy data set}
Given the qualitatively similar performance of the GP-RFM-Laplace and its diagonal version, we investigate the differences between the two methods in more detail.
Mathematically, in the RFM-Laplace-diag we restrict the feature matrix $\mM$ to be diagonal.
Therefore, the RFM-Laplace-diag is a special case of the RFM-Laplace where the latter can additionally capture covariate correlations that are relevant for the predictive task.

To highlight the advantage of the RFM-Laplace, we create a toy dataset. The covariates $\vx$ are independent and the labels $y$ are nonlinearly transformed using the first 10 covariates
\begin{align}
    \vx\sim\gU(0_d,1_d); \quad y=(\sum_{j=1}^{10} \vx_{[j]})^2.
    \label{eq:toy-data}
\end{align}

This dataset is crafted to challenge methods that struggle to determine the direction in which covariates are combined, i.e. off-diagonal correlation of covariates. We compare the performance in terms of NLL for a range of feature sizes in \Cref{fig:toy-data-nll}. The results for the performance in terms of RMSE on all methods can be found in \Cref{fig:toy-data-app}.

We observe that the GP-RFM-Laplace outperforms all methods for all covariate dimensions. This demonstrates that a non-diagonal metric in the RFM-Laplace in contrast to diagonal metrics used in kernels with ARD can benefit the performance considerably and has been underexplored in the community.
%
Furthermore, the results in \Cref{tab:main_table} and \Cref{fig:main-tabular-benchmark} show that no GP-based method outperforms the GP-RFM-Laplace. However, the diagonal kernel with ARD (GP-ARD-Laplace) performs similarly well to our GP-RFM-Laplace for many datasets. Therefore, we conjecture that in many real-world datasets, there is either little covariate correlation or the covariate correlation is not relevant for the predictive task.
For datasets where the GP-RFM-Laplace considerably outperforms the GP-ARD-Laplace, such as the `isolet' (Isolated Letter Speech Recognition) dataset from OpenML, we observe that there is indeed considerable covariate correlation.%\looseness=-1

\begin{figure}[t]
    \centering
    % \includegraphics{figuresTikz/toy_data_nll}
    % \tikzsetnextfilename{toy_data_nll}
    \input{tikz/toy_data_nll}
    \caption{Toy dataset with covariate correlation for prediction. We scale the number of train samples with $n=20 d$.}
    \label{fig:toy-data-nll}
\end{figure}











\subsection{Visualizing feature matrices}
% some introductory sentence
To get a better understanding of the learnt feature matrix~$\mM$, we visualize the normalized feature matrices for the RFM-Laplace and its diagonal version RFM-Laplace-diag in \Cref{fig:feature-matrix}.
% top: toy data
On the top row, we compare both methods for the toy dataset, where we generated the labels with correlating covariates according to \Cref{eq:toy-data}.
For this dataset, we can compute the true feature matrix through the Jacobian of the labels with respect to the covariates to obtain the true feature matrix $\mM$.
% true feature matrix with block of 10x10 1/n sum^n (sum^10 x_i)^2 but normalized to one
The true feature matrix is a block matrix with a $10\times10$ block of $\frac{1}{n}\sum_{i=1}^{n} (\sum_{j=1}^{10}\vx_{i[j]})^2$ and the remaining entries are zero, where $\vx_{i[j]}$ denotes the $j$th dimension of the $i$th sample. 
It is necessary to learn this non-zero block to capture the relevant covariate correlation.
Experimentally, as we expected, in \Cref{fig:feature-matrix} the RFM-Laplace learns relevant covariate correlation as indicated by nonzero off-diagonal values of the feature matrix while the diagonal methods are unable to capture this relation.

% bottom: kin8nm
On the bottom row, we compare both methods on the Kin8nm dataset from the UCI benchmark.
In this real-world dataset, the RFM-Laplace captures the non-zero covariate correlation and focuses on a low-dimensional set of covariates. This ability of RFMs to learn low-dimensional features has been proven for linear RFMs in \citet{radhakrishnan2024linear}.
Additionally, we can qualitatively see that the RFM-Laplace learns the same diagonal covariate re-weighing as the RFM-Laplace-diag. % and is, therefore, a direct generalization.
Therefore, the RFM-Laplace is a direct generalisation of the RFM-Laplace-diag and can learn more complex features, which allows for both of these datasets to be predicted more accurately.

\begin{figure}[htb]
    \centering
    % \includegraphics{figuresTikz/plot_feature_matrix_toydata_rfm}
    % \tikzsetnextfilename{plot_feature_matrix_toydata_rfm}
    \input{tikz/plot_feature_matrix_toydata_rfm}
    % \includegraphics{figuresTikz/plot_feature_matrix_toydata_rfm_diag}
    % \tikzsetnextfilename{plot_feature_matrix_toydata_rfm_diag}
    \input{tikz/plot_feature_matrix_toydata_rfm_diag}
    % \input{tikz/plot_feature_matrix_toydata}


    % \includegraphics{figuresTikz/plot_feature_matrix_uci_rfm}
    % \tikzsetnextfilename{plot_feature_matrix_uci_rfm}
    \input{tikz/plot_feature_matrix_uci_rfm}
    % \includegraphics{figuresTikz/plot_feature_matrix_uci_rfm_diag}
    % \tikzsetnextfilename{plot_feature_matrix_uci_rfm_diag}
    \input{tikz/plot_feature_matrix_uci_rfm_diag}
    \caption{Normalized feature matrices for toy data (top) and Kin8nm dataset from UCI benchmark (bottom).}
    \label{fig:feature-matrix}
\end{figure}







\subsection{Out-of-distribution data}
% motivation for OOD data
Having established that the GP-RFM-Laplace is a competitive method for probabilistic regression, we now investigate its performance on out-of-distribution (OOD) data.
Distribution shift depicts a common scenario in real-world applications where for example the test data distribution changes over time.
One hope of utilising a probabilistic model is to obtain more reliable predictions by indicating when the model is uncertain about its predictions.
Understanding how well the GP-RFM-Laplace performs in such scenarios is essential for assessing its robustness and applicability in real-world settings.

% definition of OOD in our setting
In our setting, we concentrate on real-world data shifts.
Here, we focus on label shifts, i.e. the marginal distribution of the labels $p(y)$ change,
while in the \Cref{sec:app-distribution-shift} we also consider covariate shifts, i.e. the marginal distribution of the covariates $p(\vx)$ change.
Specifically, we take four house datasets from the OpenML benchmark for which the labels describe the house value and include a covariate for latitude and longitude.
We define the ID data such that $p(y>a)=0.7$ where $a$ is the $70\%$ quantile of the labels and the OOD data such that $p(y<a)$.
We then split the OOD data into four consecutive non-overlapping datasets, where each OOD dataset contains $7.5\%$ of the data.
This results in one ID dataset and four OOD datasets (denoted with OOD-1 to OOD-4) with increasingly severe label shifts.

% results and interpretation
\Cref{fig:ood-main_paper} shows the results on ID and OOD data for different methods.
We notice in the top figure that the NLL of the boosting-based method rises with increasing severity of label shift while the GP-based methods improve. Overall, the GP-based methods including the GP-RFM are the most robust.
This reliability is confirmed by the lower CE of the GP-based methods which shows that the model is better calibrated under label shift, see \Cref{fig:ood-main_paper} (bottom). We have to note that for large distribution shifts, none of the methods are calibrated anymore.
Generally, our results indicate that Boosting-based methods are less robust to label shifts as defined in our scenario.


\begin{figure}[t]
    \centering
    % \includegraphics{figuresTikz/ood_nll_main}
    % \tikzsetnextfilename{ood_nll_main}
    \input{tikz/ood_nll_main}
    % \includegraphics{figuresTikz/ood_ce_main}
    % \tikzsetnextfilename{ood_ce_main}
    \input{tikz/ood_ce_main}
    \caption{
        Out-of-distribution experiment: NLL (top) and CE (bottom) on four house datasets with label shift. We show mean and standard deviation.
        }
    \label{fig:ood-main_paper}
\end{figure}







