\section{Experiments}
\label{sec:experiments}

%%TODO!: Number of calibration points?
%% SPECIFY TREE PARAMETERS AND LGBM SURROGATE
%% AGREGAR RTREE WITH PINBALL LOSS
%% KMEANS PARTITION 

% We evaluate the proposed method on both real and synthetic datasets. 
We evaluate the proposed method on a variety of datasets and show how the proposed $\textsc{mcr}$-score-based method is able to identify a set of groups whose local coverage is close to the desired target, and show that this diminishes the under- and over-coverage gaps compared to the alternatives. 

\subsection{Regression Dataset Results}

We used Gradient boosting ($\textsc{lgbm}$) \cite{ke2017lightgbm} as our base regressor $f$; the hyperparameters for each dataset were selected using  hyperparameter optimization using \cite{akiba2019optuna} to minimize validation loss.  Additional results using Lasso are shown in Appendix \ref{sec:appendix_results}. For all experiments, we split the available training data as follows: 40\% train, 40\% calibration, 20\% test. We use a target coverage/validity of 0.9 (90\%, $\alpha = 0.1$).

\begin{table}[h!]
\centering
\footnotesize
\scalebox{0.7}{
\begin{tabular}{l|r|rrr|c}
% \footnotesize
\toprule
 &\multicolumn{1}{c}{$\textsc{mcr}$ } &  \multicolumn{3}{c}{coverage} & \multicolumn{1}{c}{num } \\
model &  & average &   max group &   min group  &     groups  \\
                                              &          &       &       &         \\
\midrule
\multicolumn{6}{l}{Housing: nsamples = 506, nfeatures = 13 | $\textsc{lgbm}$-Regressor R2 = 0.64 $\pm$  0.03}\\
\midrule

$\textsc{lcp-rf-g}$ &  1.45$\pm$ 1.14 &   .8$\pm$ .04 &  .91$\pm$ .07 &  .64$\pm$ .15 &  3.6$\pm$ .55 \\
$\textsc{rf-g}$ &   .77$\pm$ .6 &  .93$\pm$ .03 &  .99$\pm$ .01 &  .86$\pm$ .06 &  3.6$\pm$ .55 \\
$\textsc{pb-kmeans}$  &   .81$\pm$ .3 &  .92$\pm$ .02 &  .97$\pm$ .04 &  .68$\pm$ .33 &  8.4$\pm$ 8.65 \\
$\textsc{mcr-kmeans}$ &     .75$\pm$ .12  &  \textbf{.91$\pm$ .05} &  \textbf{.95$\pm$ .05} &  .84$\pm$ .13 &    2.2$\pm$ 1.64\\
$\textsc{pb\_dtree}$  &  .68$\pm$ .31 &  .89$\pm$ .02 &  .94$\pm$ .03 &  .83$\pm$ .04 &  3.4$\pm$ .55 \\
$\textsc{mcr\_dtree}$  & \textbf{.65$\pm$ .17} &  .92$\pm$ .03 &  \textbf{.95$\pm$ .04} &  \textbf{.88$\pm$ .07} &   2.2$\pm$ 1.3 \\

\midrule
\multicolumn{6}{l}{Concrete: nsamples = 1030, nfeatures = 8 | $\textsc{lgbm}$-Regressor R2 = 0.82 $\pm$  0.026}\\
\midrule


$\textsc{lcp-rf-g}$ &  1.84$\pm$ 1.66 &  .83$\pm$ .01 &  .94$\pm$ .05 &  .69$\pm$ .11 &  4.6$\pm$ .55 \\
$\textsc{rf-g}$ &  .82$\pm$ .68 &   \textbf{.9$\pm$ .05} &  .97$\pm$ .02 &  .81$\pm$ .11 &  4.6$\pm$ .55 \\
$\textsc{pb-kmeans}$  & .66$\pm$ .48 &  .91$\pm$ .05 &  .97$\pm$ .05 &  .83$\pm$ .07 &  7.0$\pm$ 3.24 \\
$\textsc{mcr-kmeans}$ &  .88$\pm$ .27 &  .91$\pm$ .05 &  \textbf{.92$\pm$ .06} &  \textbf{.88$\pm$ .05} &  4.2$\pm$ 7.16 \\
$\textsc{pb\_dtree}$  &  .94$\pm$ .57 &  .89$\pm$ .04 &  .98$\pm$ .02 &  .77$\pm$ .07 &  6.6$\pm$ .55 \\
$\textsc{mcr\_dtree}$  &   \textbf{.55$\pm$ .72} &   \textbf{.9$\pm$ .04} &  \textbf{.92$\pm$ .06} &  \textbf{.88$\pm$ .04} &  2.4$\pm$ 2.61 \\

\midrule
\multicolumn{6}{l}{Energy: nsamples = 768, nfeatures = 8 | $\textsc{lgbm}$-Regressor R2 = 0.93 $\pm$  0.05}\\
\midrule

$\textsc{lcp-rf-g}$ &.99$\pm$ 1.31 &  .87$\pm$ .06 &  .97$\pm$ .03 &  .65$\pm$ 0.05 &    5.0$\pm$ 1.0 \\
$\textsc{rf-g}$ &.65$\pm$ .1 &  \textbf{.92$\pm$ .03} &  .99$\pm$ .02 &  0.87$\pm$ .06 &   4.8$\pm$ 1.64  \\
$\textsc{pb-kmeans}$  &  1.04$\pm$ .34 &  .85$\pm$ .07 &    1.0$\pm$ .0 &  .07$\pm$ .15 &  47.8$\pm$ 1.79 \\
$\textsc{mcr-kmeans}$ &   .68$\pm$ .3 &  .94$\pm$ .03 &  \textbf{.96$\pm$ .05} &  .78$\pm$ .17 &   1.6$\pm$ 9.5 \\
$\textsc{pb\_dtree}$  &   .63$\pm$ .5 &  .93$\pm$ .03 &  .97$\pm$ .02 &  .87$\pm$ .07 &   3.6$\pm$ 1.52 \\
$\textsc{mcr\_dtree}$  &  \textbf{.5$\pm$ .46} &  \textbf{.92$\pm$ .03} &  \textbf{.96$\pm$ .03} &  \textbf{.88$\pm$ .07} &   3.2$\pm$ 1.64 \\

\midrule
\multicolumn{6}{l}{Power: nsamples = 9568, nfeatures = 4 | $\textsc{lgbm}$-Regressor R2 = 0.95 $\pm$  0.01}\\
\midrule
$\textsc{lcp-rf-g}$ & 3.67$\pm$ 2.26 &  .82$\pm$ .05 &  .86$\pm$ .03 &  .78$\pm$ .07 &    4.4$\pm$ 1.95\\
$\textsc{rf-g}$ & \textbf{.47$\pm$ .22} &    \textbf{.9$\pm$ .0} &  \textbf{.92$\pm$ .01} &  \textbf{.88$\pm$ .01} &    5.0$\pm$ .71 \\
$\textsc{pb-kmeans}$  &  .76$\pm$ .18 &  \textbf{.9$\pm$ .01} &  .95$\pm$ .03 &  .85$\pm$ .02 &  15.0$\pm$ 7.55 \\
$\textsc{mcr-kmeans}$ &   .66$\pm$ .23 &  .91$\pm$ .01 &  .96$\pm$ .03 &  .86$\pm$ .02 &  16.6$\pm$ 10.26 \\
$\textsc{pb\_dtree}$  &   1.13$\pm$ .6 &   \textbf{.9$\pm$ .0} &  .98$\pm$ .04 &  .76$\pm$ .09 &  17.2$\pm$ 9.26 \\
$\textsc{mcr\_dtree}$  &  .57$\pm$ .2 &  \textbf{.9$\pm$ .01} &  \textbf{.92$\pm$ .03} &  \textbf{.88$\pm$ .03} &   5.8$\pm$ 8.56 \\


\midrule
\multicolumn{6}{l}{Protein: : nsamples = 45730, nfeatures = 9 | $\textsc{lgbm}$-Regressor R2 = 0.46 $\pm$  0.04}\\
\midrule

$\textsc{lcp-rf-g}$ &.83$\pm$ .56 &  \textbf{.9$\pm$ .01} &  .94$\pm$ .05 &  .85$\pm$ .02 &   10.5$\pm$ 6.4 \\
$\textsc{rf-g}$ & .61$\pm$ .36 &  \textbf{.9$\pm$ .0} &  .95$\pm$ .05 &  .88$\pm$ .03 &   11.0$\pm$ 7.0 \\
$\textsc{pb-kmeans}$  &  .59$\pm$ .57 &  \textbf{.9$\pm$ .0} &    1.0$\pm$ .0 &  .71$\pm$ .22 &   4.8$\pm$ 5.67 \\
$\textsc{mcr-kmeans}$ &   .47$\pm$ .3 &  \textbf{.9$\pm$ .0} &  .97$\pm$ .05 &  .87$\pm$ .03 &   11.4$\pm$ 8.26 \\
$\textsc{pb\_dtree}$  &  .79$\pm$ .27 &  \textbf{.9$\pm$ .0} &    1.0$\pm$ .0 &  .81$\pm$ .01 &   31.2$\pm$ .45 \\
$\textsc{mcr\_dtree}$  &  \textbf{.17$\pm$ .14} &  \textbf{.9$\pm$ .0} &  \textbf{.91$\pm$ .01} &  \textbf{.89$\pm$ .01} &    4.4$\pm$ .89 \\




\midrule
\multicolumn{6}{l}{kin8mn: : nsamples = 8192, nfeatures = 8 | $\textsc{lgbm}$-Regressor R2 = 0.62 $\pm$  0.03}\\
\midrule

$\textsc{lcp-rf-g}$ &   2.32$\pm$ 1.1 &  .8$\pm$ .02 &  .84$\pm$ .02 &  .75$\pm$ .04 &   4.6$\pm$ 1.34 \\
$\textsc{rf-g}$ &  \textbf{.32$\pm$ .18} &   \textbf{.9$\pm$ .0} &  .93$\pm$ .01 &  .87$\pm$ .01 &   5.2$\pm$ .45 \\
$\textsc{pb-kmeans}$  &  .96$\pm$ 0.67 &  .92$\pm$ .0 &    1.0$\pm$ .0 &  .72$\pm$ .03 &  41.0$\pm$ 8.57 \\
$\textsc{mcr-kmeans}$ & .76$\pm$ .16    &  .91$\pm$ .02 &  .94$\pm$ .05 &  .82$\pm$ .11 &  20.6$\pm$ 7.06 \\
$\textsc{pb\_dtree}$  &    .73$\pm$ .39 & \textbf{.9$\pm$ .01} &  .97$\pm$ .03 &   .8$\pm$ .07 &  16.4$\pm$ 6.58 \\
$\textsc{mcr\_dtree}$  &    .4$\pm$ .2 &  \textbf{.9$\pm$ .01} &  \textbf{.91$\pm$ .02} &  \textbf{.89$\pm$ .02} &   3.0$\pm$ 1.41 \\
\bottomrule
\end{tabular}
}
\caption{Comparison between the group discovery partition methods. We show MCR, marginal, minimum, and maximum coverage group coverage on the identified partition. We also report the number of groups per approach. Standard deviations are computed across 5 data splits. The proposed $\textsc{mcr\_dtree}$ is consistently better in terms of $\textsc{mcr}$,  with values consistently below $1$, indicating that the discovered groups improve worst-group under-coverage w.r.t. to single threshold SCP. Every dataset uses a $\textsc{lgbm}$ regressor as the base model. We highlight the lowest $\textsc{mcr}$ and the smallest average coverage above the objective (0.9) {since models with larger coverages are less efficient}. For methods that achieved the marginal coverage objective we highlight the max and min group coverage closest to the 0.9 objective.}
\label{tab:summary_methods}
% \vspace{-0.1in}
\end{table}

\begin{figure}[h!]
\centering

% \subfloat[Housing]{
% \includegraphics[width=0.45\columnwidth]{UAI2024/figures/housing_jp.png}
% \label{fig:housing_jp}} 
% \subfloat[Concrete]{
% \includegraphics[width=0.45\columnwidth]{UAI2024/figures/concrete_jp.png}
% \label{fig:concrete_jp}}

\subfloat[Energy]{
\includegraphics[width=0.45\columnwidth]{UAI2024/figures/energy_jp.png}
\label{fig:energy_jp}} 
\subfloat[Power]{
\includegraphics[width=0.45\columnwidth]{UAI2024/figures/power_JP.png}
\label{fig:power_JP}} 

\subfloat[Kin8mn]{
\includegraphics[width=0.45\columnwidth]{UAI2024/figures/kin8mn_jp.png}
\label{fig:kin8mn_jp}} 
\subfloat[Protein]{
\includegraphics[width=0.45\columnwidth]{UAI2024/figures/protein_jp.png}
\label{fig:protein_jp}} 

\begin{center}
\caption{Scatter and distribution plot of the prediction interval widths (x-axis) versus coverage (y-axis) of the groups discovered by the proposed $\textsc{mcr\_dtree}$ and $\textsc{pb\_dtree}$ methods across 6 datasets. We plot all the groups obtained across 5-Fold realizations. The size of the group's points represents the group size. The target coverage is 0.9, we observe that $\textsc{mcr\_dtree}$ tends to identify a smaller number of groups of varying sizes, with group-conditional coverages concentrated around the 0.9 objective. Moreover, the identified groups show diversity in the range of interval widths. $\textsc{pb\_dtree}$ detects a significant larger number of (smaller) groups,  with a larger variance in terms of group-conditional coverage. Additional plots in Appendix \ref{sec:appendix_results}.}
\label{fig:joint_plots}
\end{center}
% \vspace{-0.2in}
\end{figure}
% \paragraph{Datasets.} We considered seven regression tasks based on datasets from the UCI repository \cite{asuncion2007uci}. These are the Boston Housing price prediction (14 attributes, Housing) \cite{harrison1978hedonic};  Wine quality prediction (11 attributes, Wine) \cite{misc_wine_quality_186}; Energy efficiency prediction(12 building parameters, Energy) \cite{misc_energy_efficiency_242}; Concrete compressive strength prediction (8 attributes, Concrete) \cite{misc_concrete_compressive_strength_165}; Estimation of the size of the residue based on different physical and chemical properties of protein tertiary structure (Protein) \cite{misc_physicochemical_properties_of_protein_tertiary_structure_265} ;  Predict the Turbine decay state coefficient of a Gas Turbine propulsion plant (16 features, Naval) \cite{misc_condition_based_maintenance_of_naval_propulsion_plants_316}; Net hourly electrical energy output prediction of a combined cycle power plant (4 features, Power) \cite{misc_combined_cycle_power_plant_294}; Predict the distance of the end-effector from a target based on the forward kinematics of a robot arm (kin8mn) \cite{rasmussen1996delve,corke1996robotics}.
% \vspace{-0.5in}
\paragraph{Datasets.} We considered six regression tasks based on datasets from the UCI repository \cite{asuncion2007uci}. These are the Boston Housing price prediction (14 attributes, Housing) \cite{harrison1978hedonic}; Energy efficiency prediction (12 building parameters, Energy) \cite{misc_energy_efficiency_242}; Concrete compressive strength prediction (8 attributes, Concrete) \cite{misc_concrete_compressive_strength_165}; Estimation of the size of the residue based on different physical and chemical properties of protein tertiary structure (Protein) \cite{misc_physicochemical_properties_of_protein_tertiary_structure_265} ; Net hourly electrical energy output prediction of a combined cycle power plant (4 features, Power) \cite{misc_combined_cycle_power_plant_294}; Predict the distance of the end-effector from a target based on the forward kinematics of a robot arm (kin8mn) \cite{rasmussen1996delve,corke1996robotics}.

\paragraph{Methods.} We evaluate the performance of Algorithm \ref{alg:meta_algo} choosing $\tau$ to be a decision tree that minimizes the pinball loss as described in Section \ref{subsec:Learning DTrees}. We use standard group-conditional split conformal ($\mathcal{A}_{CP}$) \cite{vovk2012conditional} and denote the final model as $\textsc{mcr\_dtree}$. For the $\textsc{MCR}$ score (Eq. \ref{eq:miscoverage_ratio}) we selected $d(1-\alpha,p) = (1-\alpha - p)_{+}$ as our under-coverage distance function. We constrain our decision trees to a minimum of 50 samples per leaf and max depth of 5. We set the cost complexity pruning variable as the regularization parameter $\theta$ with $\theta_0 = 1e-5$ and $\Delta_{\theta_{t}} = 9\times\theta_{t}$. We compare against a decision tree that minimizes average pinball loss (i.e., Algorithm \ref{alg:meta_algo} where $\textsc{mcr}$ is replaced by average pinball loss), we denote it as $\textsc{pb\_dtree}$. Additionally, we compare against the group-wise random forest localizer conformalization method ($\textsc{$\textsc{lcp-rf-g}$}$) proposed by \cite{amoukou2023adaptive} which generates a partition using conformity score weights extracted from a random forest, and later use a standard split conformal approach based on their identified groups ($\textsc{rf-g}$). Finally, we examine a simple K-means clustering in the input space, where the number of clusters is chosen based on best average pinball loss ($\textsc{pb-kmeans}$) and best $\textsc{mcr}$ ($\textsc{mcr-kmeans}$) with cross validation.

\paragraph{Coverage on Identified Groups.} Table \ref{tab:summary_methods} shows the minimum and maximum group coverage for the partitions recovered by each approach. We observe that the proposed $\textsc{mcr\_dtree}$ identifies partitions that consistently provide the best (or second best) minimum coverage, and smallest gap between maximum and minimum group coverage, all while achieving the target marginal coverage of 0.9. In general, $\textsc{mcr\_dtree}$ tends to identify a smaller set of groups, with a wide range of interval widths as shown in Figure \ref{fig:joint_plots}. Moreover, it achieves the smallest $\textsc{mcr}$ when compared to the competing baselines. The $\textsc{mcr}$ of $\textsc{mcr\_dtree}$ is consistently below $1$, indicating that a baseline SCP approach would yield worse results in terms of worst group under-coverage. We note that the partition identified by $\textsc{rf-g}$, once integrated with split conformal prediction, has significantly better performance than their $\textsc{lcp-rf-g}$ alternative. $\textsc{rf-g}$ achieves comparable results in some of the datasets, with larger disparity in terms of coverage gap between the identified groups, and worse $\textsc{mcr}$. $\textsc{pb-kmeans}$ and $\textsc{mcr-kmeans}$ have large variances in their performance, potentially due to the fact that $\textsc{kmeans}$ clusters do not leverage the non-conformity scores.

\paragraph{Size and Efficiency of the Identified Groups.} Figure \ref{fig:joint_plots} shows the joint distribution of the mean width and coverage of the identified groups by $\textsc{mcr\_dtree}$ and $\textsc{pb\_dtree}$ approaches across all datasets. We observe that $\textsc{mcr\_dtree}$ tends to identify a smaller number of groups when compared to $\textsc{pb\_dtree}$. $\textsc{pb\_dtree}$ tends to identify multiple groups of small sizes, with a wide range of widths and coverage ranges.  $\textsc{mcr\_dtree}$ is able to identify groups with diverse widths (as we can see in the marginal distribution of the mean width) but  the identified groups have coverages closer to the desired objective of 0.9.

\paragraph{Interpretable Groups.} Figure \ref{fig:tree_graphs} in Appendix \ref{sec:appendix_results} shows the trees discovered by $\textsc{mcr\_dtree}$. The discovered groups have different interval widths, indicating that the uncertainty on the model's prediction is non-uniform across the input space. Moreover, groups with higher uncertainty (larger mean width) tend to have a smaller size. This can inform a data collection process by encouraging the collection of samples from the identified high uncertainty minorities.
