
\section{Region Identification Based on Non-Conformity Score Quantiles}
\label{sec:region_identification}
Given a non-conformity score, we want to discover regions in the input space that \mbcomment{maximize intra-group homogeneity of the score distribution, but still differ significantly between groups}. These regions, if interpretable, provide useful insights about the uncertainty of a model's prediction. Moreover, they can be leveraged on different steps in the ML life cycle such as data filtering and collection.
% Given a non-conformity score, we want to discover regions in the input space where the distribution of the scores differs significantly. These regions, if interpretable, provide useful insights about the uncertainty of a model's prediction. Moreover, they can be leveraged on different steps in the ML life cycle such as data filtering and collection. 

% Given a miss-coverage objective $\alpha$ we want to learn a mapping $\tau: \mathcal{X} \rightarrow \mathcal{G} \times \mathbb{R}$,\footnote{we can also consider soft-clustering such that $\tau: \mathcal{X} \rightarrow \Delta^{|\mathcal{G}|-1} \times \mathbb{R}$}, that outputs a computationally-identifiable set of groups of minimum size $\delta$ and an estimate of the $1-\alpha$ conformity score quantile for each group, $\tau(X) = (g_{\tau}(X),q_{\tau}(X))$. We use $g_{\tau}(X)$ to denote the group label and $q_{\tau}(X)$ to denote the corresponding quantile estimate (i.e., score threshold). We consider $\tau$ to belong to a family of piece-wise constant models $\mathcal{T}$ such that $\forall \tau \in \mathcal{T}, \forall x_1,x_2 \in \mathcal{X} : g_{\tau}(x_1) = g_{\tau}(x_2) \rightarrow q_{\tau}(x_1) = q_{\tau}(x_2)$. 

Given a mis-coverage objective $\alpha$ we want to learn a mapping $\tau: \mathcal{X} \rightarrow \mathcal{G} \times \mathbb{R}$,\footnote{we can also consider soft-clustering such that $\tau: \mathcal{X} \rightarrow \Delta^{|\mathcal{G}|-1} \times \mathbb{R}$}, that outputs a computationally-identifiable set of groups and an estimate of the $1-\alpha$ conformity score quantile for each group, $\tau(X) = (g_{\tau}(X),q_{\tau}(X))$. We use $g_{\tau}(X)$ to denote the group label and $q_{\tau}(X)$ to denote the corresponding quantile estimate (i.e., score threshold). We consider $\tau$ to belong to a family of piece-wise constant models $\mathcal{T}$ such that $\forall \tau \in \mathcal{T}, \forall x_1,x_2 \in \mathcal{X} : g_{\tau}(x_1) = g_{\tau}(x_2) \rightarrow q_{\tau}(x_1) = q_{\tau}(x_2)$. 

Piece-wise constant models provide an interpretable characterization of the identified groups based on the input features, this is especially true for models such as trees, where the decision rules used to identify each group (leaf node) are clearly laid out. Note that our approach could also be applied to some interpretable feature space of the input by choosing $\tau(\phi(X))$ where $\phi(\cdot)$ is some mapping into an interpretable feature space. In particular, $\phi(X) = (X,f(X))$ makes the partitioning depend directly on the output of $f$. This allows the implicit identification of different uncertainty regions based on the model's prediction.

% The reason to consider piece-wise constant models is that, under some conditions, they provide interpretable characteristics about the identified groups based on input feature rules. { The reason to restrict group size to a minimum value $\delta$ is to avoid over-partitioning the input space and to improve the latter generalization properties of the model.} Note that our approach could also be applied to some interpretable feature space of the input by choosing $\tau(\phi(X))$ where $\phi(\cdot)$ is some mapping into an interpretable feature space. In particular, $\phi(X) = (X,f(X))$ makes the partitioning depend directly on the output of $f$. This allows the implicit identification of different uncertainty regions based on the model prediction.


% We are interested in piece-wise constant models since, under some conditions, they provide interpretable characteristics about the identified groups based on input features rules. Note that we consider $X$ as the input of the clustering model but we could have considered $\tau(\phi(X))$ where $\phi(\cdot)$ is some mapping into an interpretable feature space.

\subsection{Generalization of Worst Group Mis-coverage}

We want to learn a partition function $\tau(\cdot) \in \mathcal{T}$ that provides a good approximation of the conditional quantile $F^{-1}_{S|X}(1-\alpha)$,\footnote{$F^{-1}_{S|X}(1-\alpha) = \inf\{\hat{s} \in supp(P_{S|X}): P(S \le \hat{S}|X) \ge 1-\alpha\}$}. 
%In the idealized infinite sample regime any clustering of the input space would satisfy the local coverage objective $1-\alpha$, in particular one that minimizes the distance to $F^{-1}_{S|X}(1-\alpha)$ {\color{red} no comprendo el in particular}. 
% In the ideal infinite sample regime, and assuming we have access to a perfect estimate of $F^{-1}_{S|X}(1-\alpha)$, any clustering would satisfy the local coverage objective $1-\alpha$, in particular the one that minimizes the distance to $F^{-1}_{S|X}(1-\alpha)$. 
% In practice, we have access to finite dataset $\mathcal{D}$ and want to choose a clustering function $\tau(\cdot)$ (or a regularization parameter for $\mathcal{T}${\color{red}}) that presents good generalization from a family of models that may be prone to over-fitting. In particular, we want to keep a partition where the worst group conditional coverage for the identified groups is as close as possible to $1-\alpha$. To do so we first introduce our definitions of group conditional miscoverage (Definition \ref{def:conditional_miscoverage}), worst group miscoverage ratio (Definition \ref{def:miscoverage_ratio}), and then our proposed objective.
In practice, we have access to a finite dataset $\mathcal{D}$, on which the model family $\mathcal{T}$ may be prone to overfitting. Therefore, we want to choose a regularization parameter for $\mathcal{T}$ that ensures that the generalization properties of the final model are acceptable. In particular, we want to learn a partition where the worst group conditional coverage for the identified groups is as close as possible to $1-\alpha$. To do so, we first introduce our definitions of group conditional mis-coverage (Definition \ref{def:conditional_miscoverage}), worst group mis-coverage ratio (Definition \ref{def:miscoverage_ratio}), and then our proposed objective.
\begin{definition}
\label{def:conditional_miscoverage}
    Consider a distance function $d:\mathbb{R}\times \mathbb{R} \rightarrow \mathbb{R}_{\ge 0}$, $\mathcal{G}$ a set of groups with membership function $g:\mathcal{X}\rightarrow\mathcal{G}$, a threshold $q \in \mathbb{R}$, and a target coverage $1-\alpha$. The group conditional mis-coverage of threshold function $q:\mathcal{X} \rightarrow \mathbb{R}$ over variable $S$ for a group $g_j \in \mathcal{G}$ based on distance $d$ is defined as
\begin{equation}
\small
\begin{array}{l}
        MC_{\alpha}(q,g;g_j) =  \mathbb{E}_{X,S}[d(1-\alpha, P(S \le q(X)))|g(X) = g_j] 
\end{array}
\label{eq:conditional_miscoverage}
\end{equation} 
\end{definition}
% abd given set of groups we have 
% and with a universal class of functions $\mathcal{T}$, $\tau$  where the  we expect to have generalization gaps between the coverage achieved in the identified groups when considering out of sample data. However, 
%%% Maybe here add the objective w.r.t. 
% Given a partition function $\tau(\cdot) \in \mathcal{T}$, and the 
% following Definition \ref{def:conditional_miscoverage}, we are interested in measuring the worst group conditional miscoverage  w.r.t. the marginal baseline, that is, the model that outputs a single quantile estimate for the entire input space. This indicates if the proposed grouping, and corresponding quantile estimates, provides a significant improvement in terms of worst-group coverage over a simple, marginal approach. Definition \ref{def:miscoverage_ratio} presents the proposed worst group miscoverage ratio. 
Following Definition \ref{def:conditional_miscoverage}, we are interested in measuring the worst group conditional mis-coverage  w.r.t. the marginal baseline, that is, the model that outputs a single quantile estimate for the entire input space. This indicates if the proposed grouping, and corresponding quantile estimates, provide a significant improvement in terms of worst-group coverage over a simple, marginal approach. Definition \ref{def:miscoverage_ratio} presents the proposed worst group mis-coverage ratio.

\begin{definition}
\label{def:miscoverage_ratio}
    Consider a distance function $d:\mathbb{R}\times \mathbb{R} \rightarrow \mathbb{R}_{\ge 0}$, $\mathcal{G}_{\tau}$ the set of groups identified by $\tau(\cdot)$, $g_{\tau}(\cdot)$, the corresponding quantile estimator $q_{\tau}(\cdot)$ , and $\hat{q} \simeq F^{-1}_{S}(1-\alpha)$ an empirical estimate of the average $1-\alpha$ quantile of $S$. Then, we define the worst mis-coverage ratio as  
\begin{equation}
\begin{array}{l}
        % MCR(\tau) =  \frac{\max\limits_{g \in \mathcal{G_{\tau}}}  \mathbb{E}_{X,S}[d(1-\alpha, P(S \le q_{\tau}(X)))|g_{\tau}(X)=g] }{\max\limits_{g \in \mathcal{G_{\tau}}}  \mathbb{E}_{X,S}[d(1-\alpha, P(S \le \hat{q}))|g_{\tau}(X)=g] }
        MCR_{\alpha}(\tau) =  \frac{\max\limits_{g_j \in \mathcal{G_{\tau}}} MC_{\alpha}(q_{\tau},g_{\tau};g_j) }{\max\limits_{g_j \in \mathcal{G_{\tau}}}  MC_{\alpha}(\hat{q},g_{\tau};g_j) }


        % MCR_{\alpha}(\tau) =  {\max\limits_{g_j \in \mathcal{G_{\tau}}} MC_{\alpha}(q_{\tau},g_{\tau};g_j) }/{\max\limits_{g_j \in \mathcal{G_{\tau}}}  MC_{\alpha}(\hat{q},g_{\tau};g_j) }
\end{array}
\label{eq:miscoverage_ratio}
\end{equation}
\end{definition}
% {\color{blue} Rewrite MCR as ratios of MC} {\color{red} I do not know how to}
% where we denote the non-conformity random variable $S=S_f(X,Y)$, and take the expectation over $X$ and $S$ to simplify notation. In the finite sample regime we use $wMCR (\tau;\mathcal{D})$ where $\mathcal{D}=\{(x_i,y_i,s_i)\}^{n}_{i=1}$ with $s_i=S_{f}(x_i,y_i)$ represents a dataset over which the expectations are approximated. 

% Note that $MCR(\tau)$ is less than 1 if the worst group miscoverage on the proposed partition $\mathcal{G}_{\tau}$ is lower (better) than the worst miscoverage of a single quantile estimate. In such case we may prefer the proposed partition over the baseline model. Given two different group partitions, the MCR allows us to compare which of the two identified a set of groups that would benefit the most in the worst case sense over being served by a single quantile estimate model. We think of MCR as a ratio similar to the coefficient of determination $r2$ that would be controlling for overfitting but for a pessimistic worst case scenario. Moreover, it can be interpreted as cost function to evaluate different groups partition models that is computationally cheaper than a cost that relies on some optimization such as the worst computationally identifiable mis-coverage group.

For the distance function we consider $d(1-\alpha,p) = |1-\alpha - p|$ or $d(1-\alpha,p) = (1-\alpha - p)_{+}$, where the latter only considers under-coverage violations.
The $MCR_{\alpha}(\tau)$ is less than 1 if the worst group mis-coverage on the proposed partition $\mathcal{G}_{\tau}$ is lower (better) than the worst mis-coverage of a single quantile estimate. In such case we may prefer the proposed partition over the baseline model. 

Given two different group partitions, MCR allows us to compare which of the two partitions identified a set of groups that would be most benefited (in the worst group sense) by the new model over a marginal quantile estimate. Note that we could not directly compare the worst group mis-coverage (MC) between two different models directly, since the MCs are computed across different group definitions. The MCR uses the marginal baseline as an intermediary model, and allows us to compare these two models. The MCR ratio serves a similar role to the $R^2$ coefficient of determination (which compares the residual variance of a model against a constant baseline), but MCR is defined in terms of a pessimistic, worst-case scenario. MCR serves as a computationally efficient alternative to a full auditing approach where an auditor uses a sophisticated optimization procedure to identify the worst computationally identifiable group in terms of mis-coverage.

% Given two different group partitions, the MCR allows us to compare which of the two identified a set of groups that would benefit the most in the worst case sense over being served by a single quantile estimate model. 
% { Note that we could not directly compare the worst group miscoverage (MC) between two different models directly, since the MCs are computed across different group definitions. The MCR uses the marginal baseline as an intermediary model, and allows us to compare these two models}. The MCR ratio serves a similar role to the $R2$ coefficient of determination {\color{blue} add a comment on where or for what its used?: which provides a measure for the proportion of variance in the target variable that is explained by a model over a bias predictor}, but is instead defined in a pessimistic, worst-case scenario. {\color{blue} [Need to revise]. MCR serves as a cheap alternative to a more thorough auditing approach that relies on an optimization procedure to identify the worst computationally identifiable mis-coverage group.}


% In practice, we observe that the proposed $MCR$ provides a better criteria for model selection and group identification than average pinball loss or simply worst group miscoverage on a held out dataset. As we show in the experimental section, choosing only based on average pinball loss on a held-out dataset tends to identify groups of small size whose quantile estimates do not generalize well, and have worse coverage than using the average quantile estimate. On the other hand, choosing only based on worst group miscoverage tend to discard groups of low probability even in the large sample regime. This is something we analyze further in the experimental section.

In practice, we observe that the proposed $MCR$ is a better criteria for model selection and group identification than average pinball loss or simply worst group mis-coverage on a held out dataset. As we show in Section \ref{sec:experiments}, selecting a model based only on average pinball loss on a held-out dataset tends to favor models with smaller group sizes whose quantile estimates later fail to generalize, with worst-group coverages that fall behind even the marginal quantile estimate. On the other hand, choosing only based on worst group mis-coverage (i.e. worst group MC instead of MCR) tends to discard groups of low probability even in the large sample regime. This is analized further in Section \ref{sec:experiments}.

% We decide to evaluate 

\subsection{Group Discovery Objective}

% {\color{blue} UP TO HERE}

We want to learn a generalizable partition function $\tau(\cdot) \in \mathcal{T}$ that provides the best approximation of the conditional quantile $F^{-1}_{S|X}(1-\alpha)$. Additionally, we want to ensure that the worst group mis-coverage across the learned partition improves over the one achieved with a baseline model over the same partition. To do this, we consider a regularization function $\mathcal{R}_{\theta}(\tau)$ with parameters $\theta \in \Theta$ that controls the complexity of model $\tau(\cdot)$, the strength of the regularization function is chosen based on the empirical $MCR$ score over a finite dataset $\mathcal{D}^a$. This is shown below
\begin{equation}
\label{eq:gen_objective}
    \begin{array}{l}
        % \min\limits_{\tau \in \mathcal{T}_{\theta}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)\big]
        
        % \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{\mathcal{D}}\big[d(1-\alpha, P(S \le q_{\tau}(X)|X)) \big]

        % \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)) \big]
        % \\
        % s.t.: 
        % \tau \in \min_{\tau' \in \mathcal{T}}MCR(\tau') \\
        % % P(g_{\tau}(X) = g) \ge \delta, \forall g \in \mathcal{G}_{\tau}

        \theta^* \in \arg\min\limits_{\theta \in \Theta} MCR_{\alpha}(\tau_{\theta};\mathcal{D}^{a}) 
        \\
        s.t.: 
        \tau_{\theta} \in \arg\min\limits_{\tau \in \mathcal{T}}\mathbb{E}_{\mathcal{D}^b}\big[\ell_{1-\alpha}(q_{\tau}(X),S)) \big] + \mathcal{R}_{\theta}(\tau).\\
    \end{array}
\end{equation}

The final partition function $\tau^*$ is the one that minimizes the empirical expected pinball loss with regularization $\mathcal{R}_{\theta^*}$,
\begin{equation}
\label{eq:final_model_objective}
    \begin{array}{l}
        \tau^* \in \arg\min\limits_{\tau \in \mathcal{T}}\mathbb{E}_{\mathcal{D}^b}\big[\ell_{1-\alpha}(q_{\tau}(X),S)) \big] + \mathcal{R}_{\theta^*}(\tau).
    \end{array}
\end{equation}

Note that the average pinball loss is estimated over a dataset $\mathcal{D}^b$ that is independent from $\mathcal{D}^a$ but sampled from the same distribution. The objective we propose in Eq.~\ref{eq:gen_objective} essentially chooses the best model in terms of MCR score among the set of regularized, pinball-loss-minimizing models.

% We stress that this objective is meaningful as a finite sample generalization constraint, since, given access to a sufficiently large sample set to learn the group-conditinal quantiles, the MCR would be zero. In essence, in the infinite sample regime, any quantile estimated for any partition of the input space would also have infinite samples and the estimated quantile would achieve exact group conditional coverage. This is formalized in Lemma \ref{lemma:asymptotic_solution} , where we also show that, if we start with the maximum model complexity parameter (or weakest regularization strength) that we are willing to consider, $\theta_0$, the first solution is the one that will provide the best approximation of the conditional quantile $F^{-1}_{S|X}(1-\alpha)$

We stress that this objective is meaningful as a finite sample generalization constraint, since, given access to a sufficiently large sample set to learn the group-conditional quantiles, the MCR would be zero. In essence, given a large enough number of samples, any quantile estimated for any partition of the input space would also have sufficient samples such that the estimated quantile would achieve near-exact group conditional coverage. An algorithm to achieve Eq. \eqref{eq:gen_objective}, and a formalization of the above statement are provided in the following section.





% We remark that this objective is meaningful in the context of the finite sample regime, since if we assume that we had infinite samples to learn the group quantile estimators the MCR would be zero. In essence, any quantile estimated for any partition of the input space would have infinite samples and the estimated quantile would achieve exact group conditional coverage. This is formalized in Proposition {\color{red} PROP}, where we also show that if we start with the maximum model complexity parameter (or weakest regularization strength) that we are willing to consider, $\theta_0$, the first solution is the one that will provide the best approximation of the conditional quantile $F^{-1}_{S|X}(1-\alpha)$.


% \section{Region discovery optimization and conformal prediction integration}
\section{Discovering and Conformalizing  Groups in Practice}
\label{sec:region_conformal}

We consider $\theta$ to be a regularization parameter that is monotonically decreasing with model complexity. In this setting we propose Algorithm \ref{alg:meta_algo} to find the regularization strength $\theta^*$ that recovers the pinball loss minimizer with lowest MCR from a family of clustering methods $\mathcal{T}$. Following this discovery step, we then run a group-conditional conformal prediction mechanism on the discovered regions to conformalize the score quantiles and produce conformal sets/intervals with local coverage guarantees.

{
%\color{red} THIS IS ORPHANED YET
Proposition \ref{lemma:asymptotic_solution} shows that Algorithm \ref{alg:meta_algo} is optimal in the infinite sample regime, where the generalization objective is easily achieved by any partition. That is to say, even in the absence of generalization issues, Algorithm \ref{alg:meta_algo} correctly approximates the conditional quantile $F^{-1}_{S|X}(1-\alpha)$ within the desired model class (and finds the best pinball loss minimizer in the presence of generalization challenges otherwise).

\begin{proposition}
    \label{lemma:asymptotic_solution}
    Given the objective in Eq.~\ref{eq:gen_objective}, if $\mathcal{D}^a = P(X,S)$ (infinite sample regime) and $\theta_0$ in Algorithm \ref{alg:meta_algo} is the weakest admissible regularization, then $\tau^*=\tau_{\theta_0}$, which also minimizes pinball loss over all admissible regularizations
     $\mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau^*}(X),S)) \big] \le \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau_{\theta}}(X),S)) \big], \forall \theta \in \Theta$ such that $ \theta \ge \theta_0$.
\end{proposition}

% This is formalized in Lemma \ref{lemma:asymptotic_solution}, where we also show that, by starting from the maximum model complexity (weakest regularization) that we are willing to consider, $\theta_0$, the first solution is the one that will provide the best approximation of the conditional quantile $F^{-1}_{S|X}(1-\alpha)$


% \begin{lemma}
%     \label{lemma:asymptotic_solution}
%     Given the objective in Eq.~\ref{eq:gen_objective} if $\mathcal{D}^a = P(X,S)$ (infinite sample regime) $\theta^* = \theta, \forall \theta \in \Theta$. Additionally, if $\theta_0$ in Algorithm \ref{alg:meta_algo} is the weakest regularization then $\tau^*=\tau_{\theta_0}$ which is a pinball loss minimizer over all acceptable regularizations
%      $\mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau^*}(X),S)) \big] \le \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau_{\theta}}(X),S)) \big], \forall \theta \in \Theta$ such that $ \theta \ge \theta_0$.
% \end{lemma}
}



\paragraph{Learning Generalizable Quantile Score Regions.}
In Algorithm \ref{alg:meta_algo} we assume we have access to a solver for the $\tau_{\theta}$ objective, denoted as $\mathcal{M}_{1-\alpha}$, and a conformal prediction mechanism $\mathcal{A}_{CP}$ in addition to two datasets ($\mathcal{D}_1,\mathcal{D}_2$) containing input samples and their corresponding non-conformity scores. The initial parameter $\theta_0$ is the weakest acceptable regularization due to interpretability purposes (e.g., maximum tree depth), $\prod_{\Theta}(\cdot)$ a projection operator into the regularization parameter space, and $\Delta_{\theta} > 0$ a step size that guarantees a change in $\theta_{t}$ when projected into $\Theta$ unless the minimum admissible complexity bound has been reached. The final clustering model $\tau^*(\cdot)=(g_{\tau^*}, q_{\tau^*})(\cdot)$ is learned using the best regularization parameter $\theta^* \in \Theta$ in terms of MCR. This simple approach of steadily increasing regularization strength in finite increments $\delta_{\theta}$ and stopping when MCR fails to improve is sufficient for our purposes, but more sophisticated zero-order approaches could substitute this update strategy.

% We acknowledge that we are considering a naive approach of increasing the regularization by $\delta_{\theta}$ until there is not improvement in MCR but a more sophisticated zero-order approach could substitute this update strategy. 

\paragraph{Conformalizing the Conditional Quantiles of the Discovered Regions.}
The learned clustering function $g_{\tau^*}(\cdot)$ is then fed into a group conditional conformal prediction mechanism, $\mathcal{A}_{CP}$ such as \cite{vovk2012conditional,foygel2021limits,jung2022batch,gibbs2023conformal} to provide conformalized thresholds for each identified group. 
% Then, a calibration dataset $\mathcal{D}_2$ and their corresponding group labels, based the learned clustering function $g_{\tau^*}(\cdot)$, are fed into a group conditional conformal prediction mechanism, $\mathcal{A}_{CP}$ such as {\color{red} CITE } to provide conformalized thresholds for each identified group. 
% For example, we consider a standard group conditional split conformal method where $\mathcal{A}_{CP}$ provides the conformal quantile estimator $q_{\tau^*_{CP}}(\cdot)$ based on the $ \frac{\lceil{(1-\alpha)(n_g+1)}\rceil}{n_g}$ empirical quantile for each identified group,\footnote{$n_g = | \{i: g_{\tau^*}(x_i)=g\}_{ i \in \mathcal{D}_2} |$ for each $g \in \mathcal{G}_{\tau^*}$}. Here the corresponding conformal set $C_{\tau}(X_{n+1})$ for a new sample is defined as follows. 
% For example, {\color{blue} This is only possible since the groups form a partition, which is fine, but consider how this interacts with the soft clustering approach discussed earlier} we consider a standard group conditional split conformal method where $\mathcal{A}_{CP}$ provides the conformal quantile estimator $q_{\tau^*_{CP}}(\cdot)$ based on the conformal quantile of each identified group. Here the corresponding conformal set $C_{\tau}(X_{n+1})$ for a new sample is defined as follows.

For example, for clustering functions $g_{\tau^*}(\cdot)$ that partition the space with no overlaps, we consider a standard group conditional split conformal method where $\mathcal{A}_{CP}$ provides the conformal quantile estimator $q_{\tau^*_{CP}}(\cdot)$ based on the conformal quantile of each identified group. Here the corresponding conformal set $C_{\tau}(X_{n+1})$ for a new sample is defined as:
\begin{equation}
    % \begin{array}{l}
             C_{\tau}(X_{n+1}) = \{y \in \mathcal{Y}:     S_f(X_{n+1},y) \le q_{\tau_{cp}}(X_{n+1})\}
\end{equation}
% \begin{equation}
%     \begin{array}{l}
%              C_{\tau}(X_{n+1}) = \{y \in \mathcal{Y}:     S_f(X_{n+1},y) \le q_{\tau_{cp}}(X_{n+1})\},\\
%          % q_{\tau_{cp}}(X) =  \sum_{g \in \mathcal{G}_{\tau^*}}\delta_{[g_{\tau^*}(X) = g]}\mathbb{Q}(1-\alpha,\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\

%          q_{\tau_{cp}}(X_{n+1}) =  \sum_{g \in \mathcal{G}_{\tau^*}}\mathbf{1}{[g_{\tau^*}(X_{n+1}) = g]}\textsc{Q}_{1-\alpha}(\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\
%          % \mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }.
%     \end{array}
% \end{equation}

where the conformal quantile function $q_{\tau_{cp}}(\cdot)$ is
\begin{equation}
\small
q_{\tau_{cp}}(X_{n+1}) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{n+1}=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{n+1}}+1} \delta_{\infty}) 
\end{equation}
% where $\mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }$, or equivalently
% \begin{equation}
    % \begin{array}{l}
    % \small
    %      % q_{\tau_{cp}}(X) =  \sum_{g \in \mathcal{G}_{\tau^*}}\delta_{[g_{\tau^*}(X) = g]}\mathbb{Q}(1-\alpha,\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\
    %      % q_{\tau_{cp}}(X) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{\tau^*}(X)=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{\tau^*}(X)}+1} \delta_{\infty}) 
    %      q_{\tau_{cp}}(X_{n+1}) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{n+1}=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{n+1}}+1} \delta_{\infty}) 
         % \mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }.
    % \end{array}
% \end{equation}
with $g_{\tau^*}(x_i) = g_i$, and $n_{g_i}$,\footnote{$n_{g_i} = | \{j: g_{\tau^*}(x_j)=g_i\}_{ i \in \mathcal{D}_2} |$}  the number of samples of group $g_i$ in dataset $\mathcal{D}_2$, $\forall i \in [n+1]$ 

%= |\{ x_j \in \mathcal{D}_2 : g_{\tau^*}(x_j) = g_i\}|$   

% \footnote{    Note that    $ q_{\tau_{cp}}(X) = \mathbb{Q}_{1-\alpha}(\sum^n_{i=1} \frac{\mathbf{1}[g_{\tau^*}(X)=g_i)]}{n_{g_i}+1}\delta_{s_i} + \frac{1}{n_{g_{\tau^*}(X)}+1} \delta_{\infty}) $}.
% \begin{equation}
%     \begin{array}{l}
%              % C_{\tau}(X) = \{y \in \mathcal{Y}:     S_f(X,y) \le q_{\tau_{cp}}(X)\},\\
%          q_{\tau_{cp}}(X) =  \sum_{g \in \mathcal{G}_{\tau^*}}\delta_{[g_{\tau^*}(X) = g]}\mathbb{Q}(1-\alpha,\{s_i\}_{i \in \mathcal{D}^g_{2}})   \\
        
%          \mathcal{D}^g_{2} = \{(x_i,s_i):g_{\tau^*}(x_i) = g \}_{i \in\mathcal{D}_{2}  }.
          
%     \end{array}
% \end{equation}
Moreover, for each identified group $g \in \mathcal{G}_{\tau^*}$ the coverage guarantees become 
\begin{equation}
\label{eq:group_guarantees}
\begin{array}{l}
     1- \alpha \le P\Big(Y_{n+1} \in C_{\tau}(X_{n+1})| g_{\tau^*}(X_{n+1}) = g \Big)  \\
     \hspace{0.5in} \le 1- \alpha + \frac{1}{n_{g}+1}.
\end{array}
\end{equation}
Note that the upper bound depends on the number of samples $n_g$ of group $g$ in the calibration set.


\begin{algorithm}[h!]
\caption{Region Identification Meta-Algorithm}\label{alg:meta_algo}
\begin{algorithmic}
\REQUIRE Two disjoint i.i.d. datasets  $\mathcal{D}_1, \mathcal{D}_2$ containing input samples and their corresponding non-conformity scores. 

$M_{1-\alpha}(\cdot,\cdot):\mathcal{D} \times \Theta \rightarrow \mathcal{T}$ solver for function $\tau$ regularization objective in Eq. \ref{eq:gen_objective}.

$\mathcal{A}_{CP}:\mathcal{D} \times \mathcal{G}^{|\mathcal{D}|} \rightarrow \mathbb{R}^{|\mathcal{G}|}$ group-conditional conformal prediction mechanism

$\theta_0 \in \Theta$ weakest acceptable regularization parameter, $\Delta_{\theta}$ regularization parameter step size.
% $\prod_{\Theta}(\cdot)$ projection operator on the regularization parameter space
% $$\epsilon_g(f) \coloneqq \frac{1}{|D^a_g|} \sum_{(x,y,g) \in D^A_g} \epsilon(f(x),y)$$
% $$\epsilon_g(f) \coloneqq \mathbb{E}_{\mathcal{D}_g^{a}} [\epsilon(f(X),Y)].$$
% Also, let $M(Q)$ be independent of $D^a$.
% \STATE $\tau_{\theta_0} = M(\mathcal{D},\theta_0)$
\STATE \# Region Identification 
\STATE $\textsc{MCR}^* \leftarrow \infty$ Initialize best MCR init
% \STATE Split $\mathcal{D}_1$ into $\mathcal{D}^a$ and $\mathcal{D}^b$  {\color{red} ADD K FOLD}
\FOR{$t=0, \dots, T$}
\STATE $\textsc{MCR}_t \leftarrow \{\}$ Initialize MCR set $t$
\STATE \# K-fold Cross validation
\FOR{$k=1,\dots,K$} 
\STATE Split $\mathcal{D}_1$ randomly into $\mathcal{D}^{a,k}$ and $\mathcal{D}^{b,k}$  
\STATE $\tau_{\theta} = M_{1-\alpha}(\mathcal{D}^{b,k},\theta_{t})$,
\STATE $\textsc{MCR}_t \leftarrow \textsc{MCR}_t \cup MCR(\tau_{\theta},\mathcal{D}^{a,k})$
\ENDFOR
\STATE \textsc{sMCR} = $mean(\textsc{MCR}_t) + std(\textsc{MCR}_t)$
\IF{$\textsc{sMCR} < \textsc{MCR}^*$}
\STATE $\textsc{MCR}^* \leftarrow \textsc{sMCR}$,  $\theta^* \leftarrow \theta_t $
\ENDIF
\STATE $\theta_{t+1} \leftarrow \prod_{\Theta}(\theta_{t} + \Delta_{\theta})$
\ENDFOR
\STATE $\tau^* \leftarrow  M_{1-\alpha}(\mathcal{D}_1,\theta^*)$
\STATE \# Conformalize group conditional quantile predictor
\STATE $q_{\tau_{cp}} \leftarrow \mathcal{A}_{CP}(\mathcal{D}_{2},\{g_{\tau^*}(x_i)\}_{i \in \mathcal{D}_{2}})$
\STATE \OUTPUT $\tau_{cp}=(q_{\tau_{cp}},g_{\tau^*})$
\end{algorithmic}
\end{algorithm}


% \begin{algorithm}[h!]
% \caption{Region Identification Meta-Algorithm}\label{alg:meta_algo}
% \begin{algorithmic}
% \REQUIRE Two disjoint i.i.d. datasets  $\mathcal{D}_1, \mathcal{D}_2$ containing input samples and their corresponding non-conformity scores. 

% $M_{1-\alpha}(\cdot,\cdot):\mathcal{D} \times \Theta \rightarrow \mathcal{T}$ solver for function $\tau$ regularization objective in Eq. \ref{eq:gen_objective}.

% $\mathcal{A}_{CP}:\mathcal{D} \times \mathcal{G}^{|\mathcal{D}|} \rightarrow \mathbb{R}^{|\mathcal{G}|}$ group-conditional conformal prediction mechanism

% $\theta_0 \in \Theta$ weakest acceptable regularization parameter, $\Delta_{\theta}$ regularization parameter step size.
% % $\prod_{\Theta}(\cdot)$ projection operator on the regularization parameter space
% % $$\epsilon_g(f) \coloneqq \frac{1}{|D^a_g|} \sum_{(x,y,g) \in D^A_g} \epsilon(f(x),y)$$
% % $$\epsilon_g(f) \coloneqq \mathbb{E}_{\mathcal{D}_g^{a}} [\epsilon(f(X),Y)].$$
% % Also, let $M(Q)$ be independent of $D^a$.
% % \STATE $\tau_{\theta_0} = M(\mathcal{D},\theta_0)$
% \STATE \# Region Identification 
% \STATE $\textsc{MCR}^* \leftarrow \infty$ Initialize best MCR init
% % \STATE Split $\mathcal{D}_1$ into $\mathcal{D}^a$ and $\mathcal{D}^b$  {\color{red} ADD K FOLD}
% \FOR{$t=0, \dots, T$}
% \STATE $\textsc{MCR}_t \leftarrow \{\}$ Initialize MCR set $t$
% \STATE \# K-fold Cross validation
% \FOR{$k=1,\dots,K$} 
% \STATE Split $\mathcal{D}_1$ randomly into $\mathcal{D}^{a,k}$ and $\mathcal{D}^{b,k}$  
% \STATE $\tau_{\theta} = M_{1-\alpha}(\mathcal{D}^{b,k},\theta_{t})$,
% \STATE $\textsc{MCR}_t \leftarrow \textsc{MCR}_t \cup MCR(\tau_{\theta},\mathcal{D}^{a,k})$
% \ENDFOR
% \STATE \textsc{sMCR} = $mean(\textsc{MCR}_t) + std(\textsc{MCR}_t)$
% \IF{$\textsc{sMCR} < \textsc{MCR}^*$}
% \STATE $\textsc{MCR}^* \leftarrow \textsc{sMCR}$,  $\theta^* \leftarrow \theta_t $
% \ENDIF
% \STATE $\theta_{t+1} \leftarrow \prod_{\Theta}(\theta_{t} + \Delta_{\theta})$
% \ENDFOR
% \STATE $\tau^* \leftarrow  M_{1-\alpha}(\mathcal{D}_1,\theta^*)$
% \STATE \# Conformalize group conditional quantile predictor
% \STATE $q_{\tau_{cp}} \leftarrow \mathcal{A}_{CP}(\mathcal{D}_{2},\{g_{\tau^*}(x_i)\}_{i \in \mathcal{D}_{2}})$
% \STATE \OUTPUT $\tau_{cp}=(q_{\tau_{cp}},g_{\tau^*})$
% \end{algorithmic}
% \end{algorithm}



\subsection{Learning Decision-Tree-Based Regions}
\label{subsec:Learning DTrees}

Decision trees make a natural candidate for learning partition functions, since they are inherently interpretable, especially at lower tree depths.  We need access to a solver $M_{1-\alpha}$ that, given a dataset and a regularization parameter, provides a tree that minimizes the $1-\alpha$ average pinball loss as in Eq.\ref{eq:gen_objective}. The challenge we face with existing decision tree regression optimizers is that, as far as we know, available solvers do not support pinball loss. Therefore, we first train a surrogate model $h^* \in \mathcal{H}$ that does have access to pinball loss solvers. Then, we approximate the output of $h^*$ with the decision tree by minimizing the mean square error loss against the surrogate model's predicted (input dependent) quantile. The procedure described here to learn a decision tree for pinball loss minimization is summarized in the following objective
\begin{equation}
    \begin{array}{c}
          \tau_{\theta} \in \arg\min_{\tau \in \mathcal{T}}\mathbb{E}_{\mathcal{D}^b}\big[(q_{\tau}(X)-h^*(X))^2 \big] + \mathcal{R}_{\theta}(\tau), \\

          s.t. \quad h^* \in  \arg\min_{h \in \mathcal{H}}\mathbb{E}_{\mathcal{D}^b}\big[\ell_{1-\alpha}(h(X),S)) \big].  \\
    \end{array}
\end{equation}
% Here $h^*$ is the surrogate model and $\mathcal{H}$ a family of models for which we have access to a solver that optimizes the pinball loss (as stated in the second equation). 
In our experiments, we take $\mathcal{H}$ to be a family of gradient boosting decision trees that support pinball loss \cite{ke2017lightgbm}, and use hyperparameter optimization \cite{akiba2019optuna} to minimize overfitting in the surrogate model $h^*$. 





% ----------------




% Moreover, here the MCR 




% In practice, the objective in Eq.~\ref{eq:gen_objective} needs to be attained in the finite samples regime, where generalization becomes a challenge. Therefore, we leverage conformal prediction literature to produce an estimate of the group quantiles with generalization guarantees.

% In practice the objective in Equation 3 is optimized over finite samples. 
% \begin{equation}
% \begin{array}{l}
% \min\limits_{\tau \in \mathcal{T}} \frac{1}{n}\sum_{i \in [n]}\ell_{1-\alpha}(q_{\tau}(x_i),s_i) \\
% s.t.: \frac{1}{n}\sum_{i \in [n]}\mathbf{1}[g_{\tau}(x_i) = g] \ge \delta, \forall g \in \mathcal{G}_{\tau}
% \end{array}
% \end{equation} 



% ----------------


% In essence the above objective 



% We do this by controlling the complexity of the model $\tau$ based on the MCR. We consider a regularization function $\mathcal{R}_{\theta}(\tau)$ with parameters $\theta \in \Theta$ that controls the complexity of function $\tau(\cdot)$.  


% In essence       in the sense and minimizes the $MCR(\tau)$ objective.

% from the set of solutions that minimizes the $MCR_{\alpha}(\tau)$ objective and provides the best approximation of the conditional quantile $F^{-1}_{S|X}(1-\alpha)$. 


% To do so we propose the following objective
% \begin{equation}
% \label{eq:gen_objective}
%     \begin{array}{r}
%         % \min\limits_{\tau \in \mathcal{T}_{\theta}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)\big]
        
%         % \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{\mathcal{D}}\big[d(1-\alpha, P(S \le q_{\tau}(X)|X)) \big]

%         \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)) \big]
%         \\
%         s.t.: 
%         \tau \in \min_{\tau' \in \mathcal{T}}MCR_{\alpha}(\tau') \\
%         % P(g_{\tau}(X) = g) \ge \delta, \forall g \in \mathcal{G}_{\tau}

%     \end{array}
% \end{equation}


% % \begin{equation}
% % \label{eq:gen_objective}
% %     \begin{array}{r}
% %         % \min\limits_{\tau \in \mathcal{T}_{\theta}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)\big] 
        
% %         \min\limits_{\tau \in \mathcal{T}_{\theta}} \mathbb{E}_{\mathcal{D}}\big[d(1-\alpha, P(S \le q_{\tau}(X)|X)) \big]
        
% %         \\
% %         s.t.: 
% %         \mathcal{T}_{\theta} = \arg\min_{\tau' \in \mathcal{T}}wMCR(\tau') \\
% %         P(g_{\tau}(X) = g) \ge \delta, \forall g \in \mathcal{G}_{\tau}

        
% %     \end{array}
% % \end{equation}
% % %%%%%%%%
% % \begin{equation}
% % \label{eq:gen_objective}
% %     \begin{array}{r}
% %         % \min\limits_{\tau \in \mathcal{T}_{\theta}} \mathbb{E}_{\mathcal{D}}\big[\ell_{1-\alpha}(q_{\tau}(X),S)\big] 
        
% %         \min\limits_{\tau \in \mathcal{T}_{\theta}} \mathbb{E}_{\mathcal{D}}\big[d(1-\alpha, P(S \le q_{\tau}(X)|X)) \big]
        
% %         \\
% %         s.t.: 
% %         \mathcal{T}_{\theta} = \arg\min_{\tau' \in \mathcal{T}}wMCR(\tau') \\
% %         P(g_{\tau}(X) = g) \ge \delta, \forall g \in \mathcal{G}_{\tau}

        
% %     \end{array}
% % \end{equation}

% The proposed objective in Eq.\ref{eq:gen_objective} goal is to obtain a mapping $\tau$ from the set of models with best out of sample worst miscoverage conditional ratio that minimizes the empirical average pinball loss $\ell_{1-\alpha}$ over a finite dataset $\mathcal{D}$. We use pinball loss since its a proper score that is minimized by the $1-\alpha$ quantile conditional estimate (see Section \ref{sec:background}).
% % {\color{red} Maybe remove the smallest size constraint...}

% %% from a set of pinball loss minimizers we want to choose the one with better miscoverage conditional ratio.

% We remark that the constraint of restricting to models that minimize wMCR is only relevant in the finite sample regime. In the infinite sample regime expectation of the pinball loss is taken over the distributions, and any model that minimize it would be a minimizer of wMCR. This is because, in this ideal case, the estimates of the $1-\alpha$ quantiles ($q_{\tau}(X)$) would match perfectly the coverage target for each identified group $P(S\le q_{\tau}(X)|g_{\tau}(X)=g) = 1-\alpha, \forall g \in \mathcal{G}_{\tau}$ and $wMCR(\tau)=0$.

% {\color{red} Maybe set the above observation as a proposition.
% Can we say something about criteria selection based on pinball loss generalization (?). }


% and is measuring the improvement in coverage of the generated considering the proposed partition . If the pinball loss was minimized   this objective makes sense in the setting of finite sample regime.







% %% analisis infinite sample regime vs finite samples


% that belongs to the set of minimizers of the $MCR_{\alpha}(\tau)$ objective and .  


% approximates the conditional quantile $F^{-1}_{S|X}(1-\alpha)$ 






% % We denote the non-conformity random variable $S=S_f(X,Y)$ and learn a clustering $\tau$ by optimizing the following $1-\alpha$ pinball loss in expectation





% % \begin{equation}
% % \label{eq:gen_pinball_objective}
% %     \begin{array}{r}
% %         \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{p(X,S)}\big[\ell_{1-\alpha}(q_{\tau}(X),S)\big] \\
% %         s.t.: P(g_{\tau}(X) = g) \ge \delta, \forall g \in \mathcal{G}_{\tau}
% %     \end{array}
% % \end{equation}

% % % \begin{equation}
% % %     \begin{array}{r}
% % %         \min\limits_{\tau \in \mathcal{T}} \mathbb{E}_{p(X,Y)}\big[\ell_{1-\alpha}(q_{\tau}(X),S(X,Y))\big] \\

% % %         s.t.: P(g_{\tau}(X) = g) \ge \delta, \forall g \in \mathcal{G}_{\tau}
% % %     \end{array}
% % % \end{equation}

% % % We now show that learning this mapping by optimizing the $1-\alpha$ pinball loss approximates the most efficient partition in terms of ....


% % where $\mathcal{G}_{\tau}$ represents the set of groups learned by the mapping $\tau(\cdot)$ and $\ell_{1-\alpha}(\cdot,\cdot)$ is the pinball loss
% % \begin{equation}
% % \begin{array}{rl}
% % \ell_{1-\alpha}(q,s) &= (1-\alpha)(s-q)\mathbf{1}[s \ge q]  \\
% % &+ \alpha(q-s) \mathbf{1}[s < q]. 
% % \end{array}
% % \end{equation} 

% % Note that the objective in \ref{eq:gen_pinball_objective} is expressed in terms of the data distribution and is minimized by the best approximation to the $1-\alpha$ conditional quantile of $S$, $F^{-1}_{S|X}(1-\alpha)$,\footnote{$F^{-1}_{S|X}(1-\alpha) = \inf\{\hat{s} \in supp(P_{S|X}): P(S \le \hat{S}|X) \ge 1-\alpha\}$}, in the family $\mathcal{T}$. In this 'ideal' setting, a possible minimizer is a clustering model $\tau(\cdot)$ that has identified a set of $K$ groups of size $\delta$ that achieve the minimum expected pinball loss. Then the prediction interval would be 
% % \begin{equation}
% % \begin{array}{r}
% % C(X) = \{y \in \mathcal{Y}: S(X,y) \le F^{-1}_{S|G=g_{\tau}(X)}(1-\alpha) \}\\
% % \end{array}
% % \end{equation} 
% % which satisfies the marginal and group conditional coverage.


% % In practice, the objective in Eq.~\ref{eq:gen_pinball_objective} needs to be attained in the finite samples regime, where generalization becomes a challenge. Therefore, we leverage conformal prediction literature to produce an estimate of the group quantiles with generalization guarantees.

% % \subsection{Finite sample objective}
% % In practice the objective in Equation 3 is optimized over finite samples. 
% % \begin{equation}
% % \begin{array}{l}
% % \min\limits_{\tau \in \mathcal{T}} \frac{1}{n}\sum_{i \in [n]}\ell_{1-\alpha}(q_{\tau}(x_i),s_i) \\
% % s.t.: \frac{1}{n}\sum_{i \in [n]}\mathbf{1}[g_{\tau}(x_i) = g] \ge \delta, \forall g \in \mathcal{G}_{\tau}
% % \end{array}
% % \end{equation} 


% % ($\mathbb{E}_{p(X)}1[\tau(x)=g] \ge \delta$) and associates an estimate of the $1-\alpha$ conformity score quantile to each group $\{w_g\}_{g\in \mathcal{G}}$ such that $p(S(X,Y) \le q_{\tau}(X)| g_{\tau}(X) = g) \ge 1-\alpha$.




