
\section{Region Identification Based on Non-Conformity Score Quantiles}
\label{sec:region_identification}
Given a non-conformity score, we want to discover regions in the input space that maximizes intra-group homogeneity of the score distribution, but still differ significantly between groups. These regions, if interpretable, provide useful insights about the uncertainty of a model's prediction. Moreover, they can be leveraged on different steps in the ML life cycle such as data filtering and collection.

Given a mis-coverage objective $\alpha$ we want to learn a mapping $\tau: \mathcal{X} \rightarrow \mathcal{G} \times \mathbb{R}$,\footnote{we can also consider soft-clustering such that $\tau: \mathcal{X} \rightarrow \Delta^{|\mathcal{G}|-1} \times \mathbb{R}$}, that outputs a computationally-identifiable set of groups and an estimate of the $1-\alpha$ conformity score quantile for each group, $\tau(X) = (g_{\tau}(X),q_{\tau}(X))$. We use $g_{\tau}(X)$ to denote the group label and $q_{\tau}(X)$ to denote the corresponding quantile estimate (i.e., score threshold). We consider $\tau$ to belong to a family of piece-wise constant models $\mathcal{T}$ such that $\forall \tau \in \mathcal{T}, \forall x_1,x_2 \in \mathcal{X} : g_{\tau}(x_1) = g_{\tau}(x_2) \rightarrow q_{\tau}(x_1) = q_{\tau}(x_2)$. 

Piece-wise constant models provide an interpretable characterization of the identified groups based on the input features, this is especially true for models such as trees, where the decision rules used to identify each group (leaf node) are clearly laid out. Note that our approach could also be applied to some interpretable feature space of the input by choosing $\tau(\phi(X))$ where $\phi(\cdot)$ is some mapping into an interpretable feature space. In particular, $\phi(X) = (X,f(X))$ makes the partitioning depend directly on the output of $f$. This allows the implicit identification of different uncertainty regions based on the model's prediction.

\subsection{Generalization of Worst Group Mis-coverage}

We want to learn a partition function $\tau(\cdot) \in \mathcal{T}$ that approximates the conditional quantile $F^{-1}_{S|X}(1-\alpha)$\footnote{$F^{-1}_{S|X}(1-\alpha) = \inf\{\hat{s} \in supp(P_{S|X}): P(S \le \hat{S}|X) \ge 1-\alpha\}$}. In practice, we have access to a finite dataset $\mathcal{D}$, on which the model family $\mathcal{T}$ may be prone to overfitting. Therefore, we want to choose a regularization parameter for $\mathcal{T}$ that ensures that the generalization properties of the final model are acceptable. In particular, we want to learn a partition where the worst group conditional coverage for the identified groups is as close as possible to $1-\alpha$. To do so, we first introduce our definitions of group conditional mis-coverage (Definition \ref{def:conditional_miscoverage}), worst group mis-coverage ratio (Definition \ref{def:miscoverage_ratio}), and then our proposed objective.
\begin{definition}
\label{def:conditional_miscoverage}
    Consider a distance function $d:\mathbb{R}\times \mathbb{R} \rightarrow \mathbb{R}_{\ge 0}$, $\mathcal{G}$ a set of groups with membership function $g:\mathcal{X}\rightarrow\mathcal{G}$, a threshold $q \in \mathbb{R}$, and a target coverage $1-\alpha$. The group conditional mis-coverage of threshold function $q:\mathcal{X} \rightarrow \mathbb{R}$ over variable $S$ for a group $g_j \in \mathcal{G}$ based on distance $d$ is 
\begin{equation}
\small
\begin{array}{l}
        MC_{\alpha}(q,g;g_j) =  \mathbb{E}_{X,S}[d(1-\alpha, P(S \le q(X)))|g(X) = g_j] 
\end{array}
\label{eq:conditional_miscoverage}
\end{equation} 
\end{definition}
Following Definition \ref{def:conditional_miscoverage}, we are interested in measuring the worst group conditional mis-coverage  w.r.t. the marginal baseline, that is, the model that outputs a single quantile estimate for the entire input space. This indicates if the proposed grouping, and corresponding quantile estimates, provide a significant improvement in terms of worst-group coverage over a simple, marginal approach. Definition \ref{def:miscoverage_ratio} presents the proposed worst group mis-coverage ratio.

\begin{definition}
\label{def:miscoverage_ratio}
    Consider a distance function $d:\mathbb{R}\times \mathbb{R} \rightarrow \mathbb{R}_{\ge 0}$, $\mathcal{G}_{\tau}$ the set of groups identified by $\tau(\cdot)$, $g_{\tau}(\cdot)$, the corresponding quantile estimator $q_{\tau}(\cdot)$ , and $\hat{q} \simeq F^{-1}_{S}(1-\alpha)$ an empirical estimate of the average $1-\alpha$ quantile of $S$. Then, we define the worst mis-coverage ratio as  
\begin{equation}
\begin{array}{l}
        % $\textsc{mcr}$(\tau) =  \frac{\max\limits_{g \in \mathcal{G_{\tau}}}  \mathbb{E}_{X,S}[d(1-\alpha, P(S \le q_{\tau}(X)))|g_{\tau}(X)=g] }{\max\limits_{g \in \mathcal{G_{\tau}}}  \mathbb{E}_{X,S}[d(1-\alpha, P(S \le \hat{q}))|g_{\tau}(X)=g] }
        \textsc{mcr}_{\alpha}(\tau) =  \frac{\max\limits_{g_j \in \mathcal{G_{\tau}}} MC_{\alpha}(q_{\tau},g_{\tau};g_j) }{\max\limits_{g_j \in \mathcal{G_{\tau}}}  MC_{\alpha}(\hat{q},g_{\tau};g_j) }


        % $\textsc{mcr}$_{\alpha}(\tau) =  {\max\limits_{g_j \in \mathcal{G_{\tau}}} MC_{\alpha}(q_{\tau},g_{\tau};g_j) }/{\max\limits_{g_j \in \mathcal{G_{\tau}}}  MC_{\alpha}(\hat{q},g_{\tau};g_j) }
\end{array}
\label{eq:miscoverage_ratio}
\end{equation}
\end{definition}

For the distance function we consider $d(1-\alpha,p) = |1-\alpha - p|$ or $d(1-\alpha,p) = (1-\alpha - p)_{+}$, where the latter only considers under-coverage violations.
The $\textsc{mcr}_{\alpha}(\tau)$ is less than 1 if the worst group mis-coverage on the proposed partition $\mathcal{G}_{\tau}$ is lower (better) than the worst mis-coverage of a single quantile estimate. In such case we may prefer the proposed partition over the baseline model. 

Given two different group partitions, $\textsc{mcr}$ allows us to compare which of the two partitions identified a set of groups that would be most benefited (in the worst group sense) by the new model over a marginal quantile estimate. Note that we cannot directly compare the worst group mis-coverage (MC) between two models directly, since the MCs are computed across different group definitions. The $\textsc{mcr}$ uses the marginal baseline as an intermediary model, and allows us to compare these two models. The $\textsc{mcr}$ ratio serves a similar role to the $R^2$ coefficient of determination (which compares the residual variance of a model against a constant baseline), but $\textsc{mcr}$ is defined in terms of a pessimistic, worst-case scenario. $\textsc{mcr}$ serves as a computationally efficient alternative to a full auditing approach where an auditor uses a sophisticated optimization procedure to identify the worst computationally identifiable group in terms of mis-coverage.

In practice, we observe that the proposed $\textsc{mcr}$ is a better criteria for model selection and group identification than average pinball loss or simply worst group mis-coverage on a held out dataset. As we show in Section \ref{sec:experiments}, selecting a model based only on average pinball loss on a held-out dataset tends to favor models with smaller group sizes whose quantile estimates later fail to generalize, with worst-group coverages that fall behind even the marginal quantile estimate. On the other hand, choosing only based on worst group mis-coverage (i.e. worst group MC instead of $\textsc{mcr}$) tends to discard groups of low probability even in the large sample regime. This is analized further in Section \ref{sec:experiments}.

\subsection{Group Discovery Objective}

We want to learn a generalizable partition function $\tau(\cdot) \in \mathcal{T}$ that provides the best approximation of the conditional quantile $F^{-1}_{S|X}(1-\alpha)$. Additionally, we want to ensure that the worst group mis-coverage across the learned partition improves over the one achieved with a baseline model over the same partition. To do this, we consider a regularization function $\mathcal{R}_{\theta}(\tau)$ with parameters $\theta \in \Theta$ that controls the complexity of model $\tau(\cdot)$, the strength of the regularization function is chosen based on the empirical $\textsc{mcr}$ score over a finite dataset $\mathcal{D}^a$. This is shown below
\begin{equation}
\label{eq:gen_objective}
    \begin{array}{l}
        % \min\limits_{\tau \in \mathcal{T}_{\theta}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)\big]
        
        % \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{\mathcal{D}}\big[d(1-\alpha, P(S \le q_{\tau}(X)|X)) \big]

        % \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)) \big]
        % \\
        % s.t.: 
        % \tau \in \min_{\tau' \in \mathcal{T}}$\textsc{mcr}$(\tau') \\
        % % P(g_{\tau}(X) = g) \ge \delta, \forall g \in \mathcal{G}_{\tau}

        \theta^* \in \arg\min\limits_{\theta \in \Theta} \textsc{mcr}_{\alpha}(\tau_{\theta};\mathcal{D}^{a}) 
        \\
        s.t.: 
        \tau_{\theta} \in \arg\min\limits_{\tau \in \mathcal{T}}\mathbb{E}_{\mathcal{D}^b}\big[\ell_{1-\alpha}(q_{\tau}(X),S)) \big] + \mathcal{R}_{\theta}(\tau).\\
    \end{array}
\end{equation}
The final partition function $\tau^*$ is the one that minimizes the empirical expected pinball loss with regularization $\mathcal{R}_{\theta^*}$,
\begin{equation}
\label{eq:final_model_objective}
    \begin{array}{l}
        \tau^* \in \arg\min\limits_{\tau \in \mathcal{T}}\mathbb{E}_{\mathcal{D}^b}\big[\ell_{1-\alpha}(q_{\tau}(X),S)) \big] + \mathcal{R}_{\theta^*}(\tau).
    \end{array}
\end{equation}
Note that the average pinball loss is estimated over a dataset $\mathcal{D}^b$ that is independent from $\mathcal{D}^a$ but sampled from the same distribution. The objective we propose in Eq.~\ref{eq:gen_objective} essentially chooses the best model in terms of $\textsc{mcr}$ score among the set of regularized, pinball-loss-minimizing models.

We stress that this objective is meaningful as a finite sample generalization constraint, since, given access to a sufficiently large sample set to learn the group-conditional quantiles, the $\textsc{mcr}$ would be zero. In essence, given sufficient samples, any quantile estimated for any partition of the input space would also have sufficient samples such that the estimated quantile would achieve near-exact group conditional coverage. An algorithm to achieve Eq. \eqref{eq:gen_objective}, and a formalization of the above statement are provided in the following section.

\section{Discovering and Conformalizing  Groups in Practice}
\label{sec:region_conformal}

We consider $\theta$ to be a regularization parameter that is monotonically decreasing with model complexity. In this setting we propose Algorithm \ref{alg:meta_algo} to find the regularization strength $\theta^*$ that recovers the pinball loss minimizer with lowest $\textsc{mcr}$ from a family of clustering methods $\mathcal{T}$. Following this discovery step, we then run a group-conditional conformal prediction mechanism on the discovered regions to conformalize the score quantiles and produce conformal sets/intervals with local coverage guarantees.

{
%\color{red} THIS IS ORPHANED YET
Proposition \ref{lemma:asymptotic_solution} shows that Algorithm \ref{alg:meta_algo} is optimal in the infinite sample regime, where the generalization objective is easily achieved by any partition. That is, even in the absence of generalization issues, Algorithm \ref{alg:meta_algo} correctly approximates the conditional quantile $F^{-1}_{S|X}(1-\alpha)$ within the desired model class (and finds the best pinball loss minimizer in the presence of generalization challenges otherwise).
{Although this particular result hinges on the `infinite sample' assumption, we stress that Algorithm \ref{alg:meta_algo} also performs group-conditional conformal predictions on each of the recovered groups (last step in Algorithm \ref{alg:meta_algo}) which does have finite sample group conditional guarantees as shown in Eq.~\ref{eq:group_guarantees}}
\begin{proposition}
    \label{lemma:asymptotic_solution}
    Given the objective in Eq.~\ref{eq:gen_objective}, if $\mathcal{D}^a = P(X,S)$ (infinite sample regime) and $\theta_0$ in Algorithm \ref{alg:meta_algo} is the weakest admissible regularization, then $\tau^*=\tau_{\theta_0}$, which also minimizes pinball loss over all admissible regularizations
     $\mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau^*}(X),S)) \big] \le \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau_{\theta}}(X),S)) \big], \forall \theta \in \Theta$ such that $ \theta \ge \theta_0$.
\end{proposition}

}



\paragraph{Learning Generalizable Quantile Score Regions.}
Algorithm \ref{alg:meta_algo} assumes access to a solver for the $\tau_{\theta}$ objective, denoted as $\mathcal{M}_{1-\alpha}$, and a conformal prediction mechanism $\mathcal{A}_{CP}$ in addition to a dataset ($\mathcal{D}_1$) containing input samples and their corresponding non-conformity scores. The initial parameter $\theta_0$ is the weakest acceptable regularization due to interpretability purposes (e.g., maximum tree depth), $\prod_{\Theta}(\cdot)$ a projection operator into the regularization parameter space, and $\Delta_{\theta} > 0$ a step size that guarantees a change in $\theta_{t}$ when projected into $\Theta$ unless the minimum admissible complexity bound has been reached. The final clustering model $\tau^*(\cdot)=(g_{\tau^*}, q_{\tau^*})(\cdot)$ is learned using the best regularization parameter $\theta^* \in \Theta$ in terms of $\textsc{mcr}$. This simple approach of steadily increasing regularization strength in finite increments $\delta_{\theta}$ and stopping when $\textsc{mcr}$ fails to improve is sufficient for our purposes, but more sophisticated zero-order approaches could substitute this update strategy.

\paragraph{Conformalizing the Conditional Quantiles of the Discovered Regions.}
The learned clustering function $g_{\tau^*}(\cdot)$ is then fed into a group conditional conformal prediction mechanism, $\mathcal{A}_{CP}$ such as \cite{vovk2012conditional,foygel2021limits,jung2022batch,gibbs2023conformal} to provide conformalized thresholds for each identified group. 

For example, for clustering functions $g_{\tau^*}(\cdot)$ that partition the space with no overlaps, we consider a standard group conditional split conformal method where $\mathcal{A}_{CP}$ provides the conformal quantile estimator $q_{\tau^*_{CP}}(\cdot)$ based on the conformal quantile of each identified group. The corresponding conformal set $C_{\tau}(X_{n+1})$ for a new sample is defined as:
\begin{equation}
    % \begin{array}{l}
             C_{\tau}(X_{n+1}) = \{y \in \mathcal{Y}:     S_f(X_{n+1},y) \le q_{\tau_{cp}}(X_{n+1})\}
\end{equation}
% \begin{equation}
%     \begin{array}{l}
%              C_{\tau}(X_{n+1}) = \{y \in \mathcal{Y}:     S_f(X_{n+1},y) \le q_{\tau_{cp}}(X_{n+1})\},\\
%          % q_{\tau_{cp}}(X) =  \sum_{g \in \mathcal{G}_{\tau^*}}\delta_{[g_{\tau^*}(X) = g]}\mathbb{Q}(1-\alpha,\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\

%          q_{\tau_{cp}}(X_{n+1}) =  \sum_{g \in \mathcal{G}_{\tau^*}}\mathbf{1}{[g_{\tau^*}(X_{n+1}) = g]}\textsc{Q}_{1-\alpha}(\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\
%          % \mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }.
%     \end{array}
% \end{equation}

where the conformal quantile function $q_{\tau_{cp}}(\cdot)$ is
\begin{equation}
\small
q_{\tau_{cp}}(X_{n+1}) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{n+1}=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{n+1}}+1} \delta_{\infty}) 
\end{equation}
% where $\mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }$, or equivalently
% \begin{equation}
    % \begin{array}{l}
    % \small
    %      % q_{\tau_{cp}}(X) =  \sum_{g \in \mathcal{G}_{\tau^*}}\delta_{[g_{\tau^*}(X) = g]}\mathbb{Q}(1-\alpha,\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\
    %      % q_{\tau_{cp}}(X) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{\tau^*}(X)=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{\tau^*}(X)}+1} \delta_{\infty}) 
    %      q_{\tau_{cp}}(X_{n+1}) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{n+1}=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{n+1}}+1} \delta_{\infty}) 
         % \mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }.
    % \end{array}
% \end{equation}
with $g_{\tau^*}(x_i) = g_i$, and $n_{g_i}$,\footnote{$n_{g_i} = | \{j: g_{\tau^*}(x_j)=g_i\}_{ i \in \mathcal{D}_2} |$}  the number of samples of group $g_i$ in dataset $\mathcal{D}_2$, $\forall i \in [n+1]$ 

%= |\{ x_j \in \mathcal{D}_2 : g_{\tau^*}(x_j) = g_i\}|$   

% \footnote{    Note that    $ q_{\tau_{cp}}(X) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{\tau^*}(X)=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{\tau^*}(X)}+1} \delta_{\infty}) $}.
% \begin{equation}
%     \begin{array}{l}
%              % C_{\tau}(X) = \{y \in \mathcal{Y}:     S_f(X,y) \le q_{\tau_{cp}}(X)\},\\
%          q_{\tau_{cp}}(X) =  \sum_{g \in \mathcal{G}_{\tau^*}}\delta_{[g_{\tau^*}(X) = g]}\mathbb{Q}(1-\alpha,\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\
        
%          \mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }.
          
%     \end{array}
% \end{equation}
Moreover, for each identified group $g \in \mathcal{G}_{\tau^*}$ the coverage guarantees become 
\begin{equation}
\small
\label{eq:group_guarantees}
\begin{array}{l}
     1- \alpha \le P\Big(Y_{n+1} \in C_{\tau}(X_{n+1})| g_{\tau^*}(X_{n+1}) = g \Big)  \\
     \hspace{0.5in} \le 1- \alpha + \frac{1}{n_{g}+1}.
\end{array}
\end{equation}
Note that the upper bound depends on the number of samples $n_g$ of group $g$ in the calibration set.


\begin{algorithm}[h!]
\caption{Region Identification Meta-Algorithm}
\footnotesize
\label{alg:meta_algo}
\begin{algorithmic}
% \REQUIRE Two disjoint i.i.d. datasets  $\mathcal{D}_1, \mathcal{D}_2$ containing input samples and their corresponding non-conformity scores. 
\REQUIRE i.i.d. dataset  $\mathcal{D}_1$ of input samples and corresponding non-conformity scores. $M_{1-\alpha}(\cdot,\cdot):\mathcal{D} \times \Theta \rightarrow \mathcal{T}$ solver for $\tau$ in Eq. \ref{eq:gen_objective}. $\mathcal{A}_{CP}:\mathcal{D} \times \mathcal{G}^{|\mathcal{D}|} \rightarrow \mathbb{R}^{|\mathcal{G}|}$ group-conditional conformal prediction mechanism.
$\theta_0 \in \Theta$ weakest acceptable regularization parameter, $\Delta_{\theta}$ regularization step size.
% $\prod_{\Theta}(\cdot)$ projection operator on the regularization parameter space
% $$\epsilon_g(f) \coloneqq \frac{1}{|D^a_g|} \sum_{(x,y,g) \in D^A_g} \epsilon(f(x),y)$$
% $$\epsilon_g(f) \coloneqq \mathbb{E}_{\mathcal{D}_g^{a}} [\epsilon(f(X),Y)].$$
% Also, let $M(Q)$ be independent of $D^a$.
% \STATE $\tau_{\theta_0} = M(\mathcal{D},\theta_0)$
\STATE // Region Identification 
\STATE $\textsc{MCR}^* \leftarrow \infty$ Initialize best $\textsc{mcr}$ init
% \STATE Split $\mathcal{D}_1$ into $\mathcal{D}^a$ and $\mathcal{D}^b$  {\color{red} ADD K FOLD}
\FOR{$t=0, \dots, T$}
\STATE $\textsc{MCR}_t \leftarrow \{\}$ Initialize $\textsc{mcr}$ set $t$
\STATE // K-fold Cross validation
\FOR{$k=1,\dots,K$} 
\STATE Split $\mathcal{D}_1$ randomly into $\mathcal{D}^{a,k}$ and $\mathcal{D}^{b,k}$  
\STATE $\tau_{\theta} = M_{1-\alpha}(\mathcal{D}^{b,k},\theta_{t})$,
\STATE $\textsc{MCR}_t \leftarrow \textsc{MCR}_t \cup MCR(\tau_{\theta},\mathcal{D}^{a,k})$
\ENDFOR
\STATE \textsc{sMCR} = $mean(\textsc{MCR}_t) + std(\textsc{MCR}_t)$
\IF{$\textsc{sMCR} < \textsc{MCR}^*$}
\STATE $\textsc{MCR}^* \leftarrow \textsc{sMCR}$,  $\theta^* \leftarrow \theta_t $
\ENDIF
\STATE $\theta_{t+1} \leftarrow \prod_{\Theta}(\theta_{t} + \Delta_{\theta})$
\ENDFOR
\STATE $\tau^* \leftarrow  M_{1-\alpha}(\mathcal{D}_1,\theta^*)$
\STATE // Conformalize group conditional quantile predictor
\STATE $q_{\tau_{cp}} \leftarrow \mathcal{A}_{CP}(\mathcal{D}_{1},\{g_{\tau^*}(x_i)\}_{i \in \mathcal{D}_{1}})$
\STATE \OUTPUT $\tau_{cp}=(q_{\tau_{cp}},g_{\tau^*})$
\end{algorithmic}
\end{algorithm}

\subsection{Learning Decision-Tree-Based Regions}
\label{subsec:Learning DTrees}

Decision trees make a natural candidate for learning partition functions, since they are inherently interpretable, especially at lower tree depths.  We need access to a solver $M_{1-\alpha}$ that, given a dataset and a regularization parameter, provides a tree that minimizes the $1-\alpha$ average pinball loss as in Eq.\ref{eq:gen_objective}. The challenge we face with existing decision tree regression optimizers is that, as far as we know, available solvers do not support pinball loss. Therefore, we first train a surrogate model $h^* \in \mathcal{H}$ that does have access to pinball loss solvers. Then, we approximate the output of $h^*$ with the decision tree by minimizing the mean square error loss against the surrogate model's predicted (input dependent) quantile. The procedure described here to learn a decision tree for pinball loss minimization is summarized in the following objective
\begin{equation}
    \begin{array}{c}
          \tau_{\theta} \in \arg\min_{\tau \in \mathcal{T}}\mathbb{E}_{\mathcal{D}^b}\big[(q_{\tau}(X)-h^*(X))^2 \big] + \mathcal{R}_{\theta}(\tau), \\

          s.t. \quad h^* \in  \arg\min_{h \in \mathcal{H}}\mathbb{E}_{\mathcal{D}^b}\big[\ell_{1-\alpha}(h(X),S)) \big].  \\
    \end{array}
    \label{eq:quantile_regression_tree_objective}
\end{equation}
% Here $h^*$ is the surrogate model and $\mathcal{H}$ a family of models for which we have access to a solver that optimizes the pinball loss (as stated in the second equation). 
In our experiments, we take $\mathcal{H}$ to be a family of gradient boosting decision trees that support pinball loss \cite{ke2017lightgbm}, and use hyperparameter optimization \cite{akiba2019optuna} to minimize overfitting in the surrogate model $h^*$. 
