\section{Introduction}

In recent years, many explanations methods have been developed for explaining machine learning models, with a strong focus on local analysis, i.e., generating explanations for individual prediction (see \citep{molnar2022} for a survey). Among this plethora of methods, one of the most prominent and active techniques are Counterfactual Explanations \citep{Wachter2017CounterfactualEW}. Unlike popular local attribution methods, e.g., SHAP \citep{lundberg2020local2global} and LIME \citep{ribeiro2016why}, which highlight the importance score of each feature, Counterfactuals Explanations (CE) describe the smallest modification to the feature values that changes the prediction to a desired target. Although CE are intuitive and user-friendly by giving recourse in some scenarios (e.g., loan application), they have many shortcomings in practice. Indeed, most counterfactual methods rely on a gradient-based algorithm or heuristics approaches \citep{survey_counterfactual}, thus can fail to identify the most natural explanations and lack guarantees. Most algorithms either do not guarantee sparse counterfactuals (changes in the smallest number of features) or do not generate in-distribution samples (see \citep{counterfactual_r1, CHOU202259} for a survey on counterfactuals methods). Although some works \citep{optimalce_vidal, face_counterfactual, prototype_basedce} try to solve the plausibility/sparsity problem, the suggested solutions are not entirely satisfactory. \\ 
In another direction, many papers \citep{dice, Karimi2020ModelAgnosticCE, diverce_ce} encourages the generation of diverse counterfactuals in order to find actionable recourse \citep{Ustun2019ActionableRI}. Actionability is a vital desideratum, as some features may be non-actionable, and generating many counterfactuals increases the chance of getting actionable recourse. However, the diversity of CE makes the explanations less intelligible, and the synthesis of various CE or local explanations, in general, is yet to be comprehensively solved \citep{rethinkinxai}. In addition, recently \cite{himanoisycounterfactuals} highlights a new problem of local CE called: \textit{noisy responses to prescribed recourses}. Indeed, in real-world scenarios, some individuals may not be able to implement exactly the prescribed recourses, and they show that most CE methods fail in this noisy environment. Therefore, we propose to reverse the usual way of explaining with counterfactual by computing \textit{Counterfactual rules}. We introduce a new line of counterfactuals: we build interpretable policies for changing a decision with a given probability that ensure the stability of the deduced recourse. These policies are optimal (in sparsity) and faithful to the data distribution. Their computation comes with statistical guarantees as they use a consistent estimator of the conditional distribution. Our proposal is to find a general policy or rule that permits changing the decision while fixing some features instead of generating many counterfactual samples. One of the main challenges is to identify the (minimal) set of features that provide the best promising directions for changing the decision to the desired output. We also show this approach can be extended for finding a collection of regional counterfactuals, such that we have a global counterfactual policy for analyzing a model. An example of the counterfactual rules that we introduce is given in figure \ref{fig:oce}.
\begin{figure}[ht]
    \centering
    \includegraphics[scale=0.4]{figures/lcr_rcl_illustration.png}
    \caption{Illustration of the local and regional Counterfactuals Rules that we introduced on a dataset with 4 variables: Age, Salary, Sex, and HoursPerWeek. The Counterfactual Rules define intervals on the minimal subset of features to change the decision of a model prediction in the local counterfactual rule or the decision of a rule that applies on a sub-population in the regional counterfactual rule. In Blue, we have the proposed rules to change the decision.}
    \label{fig:oce}
\end{figure}


\section{Motivation and Related works}

Most of the methods that propose Counterfactuals Explanations are based on the approach of the seminal work of \cite{wachter2017counterfactual}: the counterfactuals are generated by optimizing a cost, but this procedure does not account directly the plausibility of the counterfactual examples (see \citep{counterfactual_r1} for classification of CE methods). Indeed, a major shortcoming is that the adverse decision needed for obtaining the counterfactual is not designed to be feasible or representative of the underlying data distribution. However, some recent studies proposed ad-hoc plausibility constraint in the optimization, using for instance an outlier score \citep{dace}, an Isolation Forest \citep{optimalce_vidal} or a density-weighted metrics \citep{face_counterfactual} to generate in-distribution samples. In another direction, \cite{prototype_basedce} proposes to use an autoencoder that penalizes out-of-distribution candidates. 
Instead of relying on ad-hoc constraints, we propose CE that gives plausible explanations by design. Indeed, for each observation, we identify the variables and associated ranges of values that have the highest probability of changing the prediction. We can compute this probability with a consistent estimator of the conditional distribution $P(Y | \boldsymbol{X}_S)$. As a consequence, the sparsity of the counterfactuals is not encouraged indirectly by adding a penalty term ($\ell_0$ or $\ell_1$) as existing works \citep{dice}. 
Our approach is inspired by the concept of \textit{Same Decision Probability (SDP)} (introduced in \citep{Chen2012TheSP}) that can be used for identifying the smallest subset of features to guarantee (with a given probability) the stability of a prediction. This minimal subset is called \textit{Sufficient Explanations}. In \citep{amoukou2021consistent}, it has been shown that the \textit{SDP} and the \textit{Sufficient Explanations} can be estimated and computed efficiently for identifying important local variables in any classification and regression models. For counterfactuals, we are interested in the dual set: we want the minimal subset of features that have a high probability of changing the decision (when the other features are fixed).
Another limitation of the current CE is their local nature and the multiplicity of the explanations produced. While some papers \citep{dice, Karimi2020ModelAgnosticCE, diverce_ce} promote the generation of diverse counterfactual samples to ensure actionable recourse, such diverse explanations should be summarized to be intelligible \citep{rethinkinxai}, but the compilation of local explanations is often a very difficult problem. To address this problem, we do not generate counterfactual samples, but we build a rule \textit{Counterfactual Rules} (CR) from which we can derive counterfactuals. Contrary to classic CE which gives the nearest instances with a desired output, we find the most effective rule for each observation (or group of similar observations) that changes the prediction to the desired target. 
This local rule easily aggregates similar counterfactuals. For example, if  $\boldsymbol{x} = \{ \texttt{Age=20, Salary=35k, HoursWeek=25h, Sex=M}, \dots \}$ with  \texttt{Loan=False}, fixing the variables \texttt{Age} and \texttt{Sex} and changing  the \texttt{Salary} and \texttt{HoursWeek} change the decision. Therefore, instead of given multiples combination of \texttt{Salary} and \texttt{HoursWeek} (e.g. 35k and 40h or 40k and 55h, \dots) that result in many instances, the counterfactual rule gives the range of values: \texttt{IF HoursWeek $\in \texttt{[35h, 50h]}$, Salary $\in$ [40k, 50k], and the} \textbf{remaining features are fixed} \texttt{THEN Loan=True}. 
It can be extended at a regional scale, e.g., given a rule $\textbf{R} = \{\texttt{IF Salary} \in \texttt{[35k, 20k],  Age} \in \texttt{[20, 80] THEN Loan=False}\}$, the regional Counterfactual Rule (CR) could be $\{ \texttt{\textbf{IF } Salary} \in \texttt{[40k, 50k],} \texttt{ HoursWeek} \in \texttt{[35h, 50h] and the} \textbf{ remaining rules are fixed}  \texttt{ THEN Loan=True}\}$. The main difference between a local and a global CR is that the Local-CR explain a single instance by fixing the remaining feature values (not used in the CR) ; while a regional-CR is defined by keeping the remaining variables in a given interval (not used in the regional-CR). Moreover, by giving ranges of values that guarantee a high probability of changing the decision, we partly answer the problem of \textit{noisy responses to prescribed recourses} \citep{himanoisycounterfactuals} so long as the perturbations are within our ranges.

Although the \textit{Local Counterfactual Rule} is new, the \textit{Regional Counterfactual Rule} can be related to some recent works. Indeed, \cite{rawal2020beyond} proposed Actionable Recourse Summaries (AReS), a framework that constructs global counterfactual recourses in order to have a global insight of the model and detect unfair behavior. While AReS is similar to the Regional Counterfactual Rule, we emphasize some significant differences. Our methods can address regression problems and deal with continuous features. Indeed, AReS needs to discretize the continuous features, inducing a trade-off between speed and performance as noticed by \citep{globalce}. Thus, too few bins result in unrealistic recourse, while too many bins result in excessive computation time. In addition, AReS uses a greedy heuristic search approach to find global recourse, which might produce sub-optimal recourse. As we have already mentioned, the changes we provide overcome these two limitations because the consistency of our counterfactual is controlled by an estimation of the probability of changing the decision, and because we favor changes of a minimum number of features.
Another global CE framework has been introduced in \citep{cet4} to ensure transparency: the Counterfactual Explanation Tree (CET) partitions the input space with a decision tree and assigns an appropriate action for changing the decision of each subspace. Therefore, it gives a unique action for changing the decision of multiple instances. In our case, we offer more flexibility in the counterfactual explanations because we provide a range of possible values that guarantee a change with a given probability. 
In our approach, we do not make any assumption about the cost of changing the feature nor the causal structure. If we have such information, then we can add it as additional post-processing such that it can be made more explicit and more transparent for the final user as required for trustworthy AI.




\section{Minimal Counterfactual Rules}
We assume that we have an i.i.d sample $\mathcal{D}_n = \{(\boldsymbol{X}_i,Y_i)_{i=1,\dots,n}\}$ such that $(\X, Y) \sim P_{(\X, Y)}$ where $\X \in \mathcal{X}$ (typically $\mathcal{X}=\mathbb{R}^p$) and $Y \in \mathcal{Y}$. The output $\mathcal{Y}$ can be discrete or continuous. We want to explain the predictor $f:\mathbb{R}^p \mapsto \mathcal{Y}$, that has been learned with the dataset $\mathcal{D}_n$. We use uppercase letters for random variables and  lowercase letters for their value assignments. For a given subset $S \subset [p]$, $\XS = (X_i)_{i \in S}$ denotes a subgroup of features, and we write $\x=(\xs,\xsb)$ (with some abuse of notation).


For an observation $(\x,y=f(\x))$,  we have a target set $\YSt \subset \mathcal{Y}$, such that  $y\notin \YSt$. For the simple case of classification problem, $\YSt = \{ y^\star\}$ is the standard singleton such that $y^\star\in \mathcal{Y}$ is different of $y$. Contrary to standard approaches, our definition of the counterfactual deals also with the regression case by considering $\YSt = [a,b]\subset \mathbb{R}$;  our definitions and computations of counterfactuals are the same for both classification and regression. We remind that the classic CE problem (defined only for classification) is to find  a function $\A: \mathcal{X} \mapsto \mathcal{X}$, such that for all observations $\x \in \mathcal{X}$, $f(\x)\neq y^\star$, and we have $f(\A(\x))=y^\star$. With standard CE, the function is defined  point-wise by solving an optimisation program. Most often $\A(\cdot)$ is not a real function, as $\A(x)$ may be in fact a collection of (random) values $\{\x_1^\star,\dots,\x_p^\star\}$. A more recent point of view was proposed by \cite{cet4}, and it defines $\A$ as a decision tree, where in each leaf $L$, the best perturbation $a_L$ is predicted and add it to all the instances $\x \in L$. \\
Our approach is hybrid,  because we do not propose a single action for each subspace of $\mathcal{X}$ or sub-group of population, but we give sets of possible perturbations. Indeed, a \emph{Local Counterfactual Rule} (Local-CR) for $\YSt$ and  observation $\x$ (with $f(\x)\notin \YSt$) is a rectangle $C_{S}(\boldsymbol{x};\YSt) = \prod_{i\in S} [a_i, b_i], a_i, b_i \in \overline{\mathbb{R}}$ such that for all perturbations of $\x=\left(\xs,\xsb \right)$ obtained as $\boldsymbol{x}^\star = \left(\zs,\xsb \right)$ with $\zs \in C_{S}(\x;\YSt)$ and $\boldsymbol{x}^\star$ an in-distribution sample, then $f\left( \boldsymbol{x}^\star\right)$ is in  $\YSt$ with a high probability.\\
Similarly, a \emph{Regional Counterfactual Rule} (Regional-CR) $C_S(\boldsymbol{R}; \YSt)$ is defined for $\YSt$ and a rectangle $\boldsymbol{R}=\prod_{i=1}^{d} [a_i, b_i], a_i, b_i \in \overline{\mathbb{R}}$, if for all observations $\x=(\xs,\xsb) \in \boldsymbol{R}$, the perturbations obtained as $\boldsymbol{x}^\star = (\zs,\xsb)$ with $\zs \in C_S(\boldsymbol{R},\YSt)$ and $\boldsymbol{x}^\star$ an in-distribution sample are such that  
$f\left( \boldsymbol{x}^\star\right)$ is in $\YSt$ with high probability.\\
We build such rectangles sequentially, first, we propose to find the best directions $S \subset [p]$ that offers the best probability of change. Then, we  find the best intervals $[a_i, b_i], i \in S$ that change the decision to the desired target. A central tool in this approach is the Counterfactual Decision Probability.

\begin{definition}
\label{def:cdp}\textbf{Counterfactual Decision Probability (CDP).} The Counterfactual Decision Probability of the subset $S\subset\left\llbracket 1,p\right\rrbracket $,
w.r.t $\boldsymbol{x}=\left(\boldsymbol{x}_{S},\boldsymbol{x}_{\bar{S}}\right)$  and the desired target $\YSt$ (s.t. $f(\x)\notin \YSt)$ i
    \[CDP_{S}\left(\YSt; \boldsymbol{x}\right)=P\left(f(\X) \in  \YSt\left|\boldsymbol{X}_{\bar{S}}=\boldsymbol{x}_{\bar{S}}\right.\right). \nonumber\]
\end{definition} 
The $CDP$ of the subset S is the probability that the decision changes to the desired target $\YSt$ by sampling the features $\XS$ given $\boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}}$. It is related to the Same Decision Probability  $SDP_{S}(\mathscr{Y}; \boldsymbol{x}) = P\left(f(\X) \in \mathscr{Y} \vert \XS=\xs \right)$ used in \citep{amoukou2021consistent} for solving the dual problem of selecting the most local important variables for obtaining and maintaining  the decision  $f(\x) \in \mathscr{Y}$ (where $f(\x)\in\mathscr{Y}\subset \mathcal{Y}$). The set $S$ is called the Minimal Sufficient Explanation. Indeed, we have $CDP_S(\YSt; \boldsymbol{x}) = SDP_{\bar{S}}(\YSt; \boldsymbol{x})$. The computation of these probabilities is challenging and discussed in Section 4. 
We now focus on the minimal subset of features $S$ such that the model makes the desired decision with a given probability $\pi$
\begin{definition}  \label{def:minimal_countset}(\textbf{ Minimal Divergent Explanations}). Given an instance $\boldsymbol{x}$ and a desired target $\YSt$, $S$ is a Divergent Explanation for probability $\pi>0$, if $CDP_{S}\left(\YSt;\boldsymbol{x}\right)\geq\pi$, and no subset $Z$ of $S$ satisfies $CDP_{Z}\left(\YSt;\boldsymbol{x}\right)\geq\pi$. 
Hence, a Minimal Divergent Explanation is a Divergent Explanation with minimal size.
\end{definition}
The set minimizing this probability is not unique, and we can have several Minimal Divergent Explanations. Note that the probability $\pi$ represents the minimum level required for a set to be chosen for generating counterfactuals, and its value should be as high as possible and depends on the use case.
We have now enough material to define our main criterion for building a Local Counterfactual Rule (Local-CR): 
\begin{definition}\label{def:local_counterfactual_rule}  (\textbf{Local Counterfactual Rule}). Given an instance $\boldsymbol{x}$, a desired target $\YSt \not\owns f(\x)$ , a Minimal Divergent Explanation $S$, the rectangle 
$C_{S}(\boldsymbol{x}; \YSt) = \prod_{i\in S} [a_i, b_i], a_i, b_i \in \overline{\mathbb{R}}$ is a Local Counterfactual Rule with probability $\pi_C$ if
\begin{align} \label{eq:crp_instance}
             CRP_S(\YSt,\x, C_S(\x;y^\star)) \triangleq P( f(\X) \in \YSt \; | \boldsymbol{X}_S \in C_S(\boldsymbol{x};\YSt), \boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}}) \geq \pi_C.
\end{align}
The $CRP_S$ is the Counterfactual Rule Probability.
\end{definition}
The higher the probability $\pi_C$ is, the better the relevance of the rule $C_S(\x; \YSt)$ is, for this instance. Given a set $S$, we seek for the maximal rectangle in the direction $S$ satisfying Definition \ref{eq:crp_instance}.

In practice, we can observe that the Local-CR $C_{S}(\cdot;\YSt)$ for neighbors $\x,\x'$ are often quite close, because the Minimal Divergent Explanations are similar and the corresponding rectangles often overlaps. Hence, this motivates a generalisation of these Local-CR to hyperrectangle $\boldsymbol{R} = \prod_{i=1}^{d} [a_i, b_i], a_i, b_i \in \overline{\mathbb{R}}$ regrouping similar observations. We denote $\text{supp}(\boldsymbol{R}) = \{i : [a_i, b_i] \neq \overline{\mathbb{R}}\}$ the support of the rectangle, and we extend the Local-CR to Regional Counterfactual Rules (Regional-CR). In order to do it, we denote $\boldsymbol{R}_{\bar{S}} = \prod_{i \in \bar{S}} [a_i, b_i]$ as the rectangle with intervals of $\boldsymbol{R}$ in $\text{supp}(\boldsymbol{R}) \cap \bar{S}$ and we also defines the corresponding Counterfactual Decision Probability CDP (Definition \ref{def:cdp}) for rule $\boldsymbol{R}$ and subset $S$ as $CDP_S(\YSt; \boldsymbol{R}) = P\left(f(\X) \in  \YSt \left|\boldsymbol{X}_{\bar{S}} \in \boldsymbol{R}_{\bar{S}}\right.\right)$. Therefore, we can also compute the Minimal Divergent Explanation for rule $\boldsymbol{R}$ using Definition \ref{def:minimal_countset} with the CDP for rules. 








\begin{definition}\label{def:regional_rule}  (\textbf{Regional Counterfactual Rule}). Given any rectangle $\boldsymbol{R}$, a desired target $\YSt$, a Minimal Divergent Explanation $S$  of $R$, the rectangle 
$C_S(\boldsymbol{R}; y^\star) = \prod_{i\in S} [a_i, b_i]$ is a Regional Counterfactual Rule with probability $\pi_C$ if 
\begin{align} \label{eq:crp_rule}
            CRP_S(\YSt; \boldsymbol{R}, C_S(\boldsymbol{R}, \YSt)) \triangleq P( f(\X) \in \YSt \; | \boldsymbol{X}_S \in C_S(\boldsymbol{R},\YSt), \XSb \in \boldsymbol{R}_{\bar{S}} )\geq \pi_C.
\end{align}
$CRP_S(\YSt; \boldsymbol{R}, C_S(\boldsymbol{R}))$ is the corresponding Counterfactual Rule Probability for rule $\boldsymbol{R}$. 
\end{definition}

\paragraph{Remarks: }  Local-CR and regional-CR differ slightly: for local, we condition by  $\boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}}$ in  Eq. \ref{eq:crp_instance}, while for regional, we condition by $\boldsymbol{X}_{\bar{S}} \in \boldsymbol{R}_{\bar{S}}$. For computing regional-CR, we can start for a rectangle generated by any method, such as \citep{bayesianRuleListRudin, OptimalDecisionTreeRudin}. The only condition is that it   contains a homogeneous group, i.e. with almost the same output. However, by default we use as initial rules the Sufficient Rules derived in  \citep{amoukou2021consistent} as it handles regression problem. The Sufficient Rules are minimal support rectangles define for a given output $\mathscr{Y}$ as $C_S(\mathscr{Y}) = \Pi_{i\in S} [a_i,b_i]$ such that $\forall \x \in \mathcal{X}, \xs \in C_S(\mathscr{Y})$, $P(f(\X) \in \mathscr{Y} \vert \XS = \xs) \geq \pi$.  




\section{Estimation of the $CDP$ and $CRP$}
In order to compute the probabilities $CDP_S$ and $CRP_S$ for any $S$, we use a dedicated Random Forest (RF) $m_{k, n}$ that learns the model $f$ to explain. Indeed, the conditional probabilities $CDP_S$ and $CRP_S$ can be easily computed from a RF by combining the Projected Forest algorithm \citep{benard2021shaff} and the Quantile Regression Forest \citep{meinshausen2006quantile}: hence we can estimate consistently the probabilities $CDP_S(\mathscr{Y}^\star; \boldsymbol{x})$. We adapt the approach used in \citep{amoukou2021consistent} and remind for the sake of completeness, the computation of the estimate of $SDP_S$. 
\subsection{Projected Forest and $CDP_S$}

The estimator of the $SDP_S$ is built upon a learned Random Forest \citep{breiman1984classification}. A Random Forest (RF) is a predictor consisting of a collection of $k$ randomized trees (see \citep{Loh2011ClassificationAR} for a detailed description of decision tree). For each instance $\boldsymbol{x}$, the predicted value of the $j$-th tree is denoted $m_n(\boldsymbol{x}, \Theta_j)$ where $\Theta_j$ represents the resampling data
mechanism in the $j$-th tree and the successive random splitting directions. The trees are then averaged to give the prediction of the forest as: 
\begin{align} 
\label{eq:random_forest_baggin_estimator} \small
    m_{k, n}(\boldsymbol{x}, \Theta_{1:k}, \mathcal{D}_n) = \frac{1}{k} \sum_{l=1}^{k} m_n(\boldsymbol{x}; \Theta_l, \mathcal{D}_n)
\end{align}
However, the RF can also be view as an adaptive nearest neighbor predictor. For every instance $\boldsymbol{x}$, the observations in $\mathcal{D}_n$ are  weighted by $w_{n, i}(\boldsymbol{x}; \Theta_{1:k}, \mathcal{D}_n)$, $i=1, \dots, n$. Therefore, the prediction of RF can be rewritten as\[ 
\small m_{k, n}(\boldsymbol{x}, \Theta_{1:k}, \mathcal{D}_n) = \sum_{i=1}^{n} w_{n, i}(\boldsymbol{x}; \Theta_{1:k}, \mathcal{D}_n) Y_i. \nonumber
\]
This emphasizes the central role played by the weights in the RF's algorithm, see \citep{meinshausen2006quantile, amoukou2021consistent} for detailed description of the weights. Therefore, it naturally gives estimators of other quantities e.g., Cumulative hazard
function \citep{ishwaran2008random}, Treatment effect \citep{wager2017estimation}, conditional density  \citep{du2021wasserstein}. For instance, \cite{meinshausen2006quantile} showed that we can used the same weights to estimate the Conditional Distribution Function with the following estimator:
\begin{align} 
    \widehat{F}(y | \boldsymbol{X} = \boldsymbol{x}, \Theta_{1:k}, \mathcal{D}_n) = \sum_{i=1}^{n} w_{n, i}(\boldsymbol{x}; \Theta_{1:k}, \mathcal{D}_n) \mathds{1}_{Y_{i} \leq y} 
    \label{eq:estimator_boostrap_quantile}
\end{align}
In another direction, \cite{benard2021shaff} introduced the Projected Forest algorithm \citep{benard2021mda, benard2021shaff} that aims to estimate $E[Y | \boldsymbol{X}_S]$ by modifying the RF's prediction algorithm.

\paragraph{Projected Forest:}  To estimate $E[Y | \XS = \xs]$ instead of $E[Y | \boldsymbol{X} = \x]$ using a RF, \cite{benard2021interpretable} suggests to simply ignore the splits based on the variables not contained in $S$ from the tree predictions. More formally, it consists of projecting the partition of each tree of the forest on the subspace spanned by the variables in S.  The authors also introduced an algorithmic trick that computes the output of the Projected Forest efficiently without modifying the initial tree structures. We drop the observations down in the initial trees, ignoring the splits which use a variable not in $S$: when it encounters a split involving a variable $i \notin S$, the observations are sent both to the left and right children nodes. Therefore, each instance falls in multiple terminal leaves of the tree. To compute the prediction of $\xs$, we follow the same procedure, and gather the set of terminal leaves where $\xs$ falls. Next, we collect the training observations which belong to every terminal leaf of this collection, in other words, we keep only the observations that fall in the intersection of the leaves where $\xs$ falls. Finally, we average their outputs $Y_i$ to generate the estimation of $E[Y | \XS = \xs]$. Notice that the author show that this algorithm converges asymptotically to the true projected conditional expectation $E[Y | \XS = \xs]$.


As the RF, the PRF gives also  a weight to each observation. The associated PRF is denoted $
m_{k, n}^{(\xs)}(\xs) = \sum_{i=1}^{n} w_{n, i}(\xs) Y_i$. Therefore, as the weights of the original forest was used to estimate the CDF in equation \ref{eq:estimator_boostrap_quantile}, \cite{amoukou2021consistent} used the weights of the Projected Forest Algorithm to estimate the $SDP$ as $\widehat{SDP}_{S}\left(\mathscr{Y};\boldsymbol{x}\right) = \sum_{i=1}^{n} w_{n, i}(\xs) \mathds{1}_{Y_i \in  \mathscr{Y}}$. The idea is essentially to replace $Y_i$ by $\mathds{1}_{Y_i \in \mathscr{Y}}$ in the Projected Forest equation defined above. The authors also show that this estimator converges asymptotically to the true $SDP_S$. Therefore, we can estimate the $CDP$ with the following estimator
\begin{equation} \label{eq:CDP_estimator}
    \widehat{CDP}_{S}\left(\YSt;\boldsymbol{x}\right) = \sum_{i=1}^{n} w_{n, i}(\xsb) \mathds{1}_{Y_i \in \YSt}.
\end{equation}
\paragraph{Remarks:} Note that we only give the estimator of the $CDP_S$ of an instance $\x$. The estimator of the $CDP_S$ of a rule $R$ will be discussed in the next section as it is related to the estimator of the $CRP_S$.

\subsection{Regional RF and $CRP_S$}
In this section, we focus on the estimation of the $CRP_S(\YSt, \x, C_S(\x;\YSt)) = P(f(\X) \in  \YSt \; | \boldsymbol{X}_S \in C_S(\boldsymbol{x}; \YSt), \boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}})$ and $CRP_S(\YSt, \boldsymbol{R}, C_S(\boldsymbol{R};\YSt)) = P(f(\X) \in \YSt \; | \boldsymbol{X}_S \in C_S(\boldsymbol{R};\YSt), \boldsymbol{X}_{\bar{S}} \in \boldsymbol{R}_{\bar{S}})$. For simplicity, we remove the dependency of the rectangles in $\YSt$. Based on the previous Section, we already know that the estimators using the RF will be in the form of $\widehat{CRP}_{S}\left(\YSt,\boldsymbol{x}, C_S(\boldsymbol{x})\right) = \sum_{i=1}^{n} w_{n, i}(\x) \mathds{1}_{Y_i \in \YSt}$, thus we only need to find the right weighting. The main challenge is that we have a condition based on a region, e.g., $\XS \in C_S(\boldsymbol{x})$ or $\boldsymbol{X}_{\bar{S}} \in \boldsymbol{R}_{\bar{S}}$ (regional-based) instead of condition of type $\XS = \xs$ (fixed value-based) as usually. However, we introduced a natural generalization of the RF algorithm to make predictions when the conditions are both regional-based and fixed value-based. Thus, the case where there are only regional-based conditions are naturally derived. 

\paragraph{Regional RF to estimate $CRP_S(\YSt,\x, C_S(\x)) = P(f(\X) \in \YSt \; | \boldsymbol{X}_S \in C_S(\boldsymbol{x}), \boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}})$:} The algorithm is based on a slight modification of RF.
Its works as follow: we drop the observations in the initial trees, if a split used variable $i \in \bar{S}$, i.e., fixed value-based condition, we use the classic rules of RF, if $x_i \leq t$, the observations go to the left children, otherwise the right children. However, if a split used variable $i \in S$, i.e, regional-based condition, we use the rectangles $C_S(\boldsymbol{x}) = \prod_{i=1}^{|S|} [a_i, b_i]$. The observations are sent to the left children if $b_i \leq t$, right children if $a_i > t$ and if $t \in [a_i, b_i]$ the observations are sent both to the left and right children. Therefore, we use the weights of the Regional RF algorithm to estimate the $CRP_S$ as in equation \ref{eq:CDP_estimator}, the estimator is $\widehat{CRP}_S(y^\star; \boldsymbol{x}, C_S(\boldsymbol{x})) = \sum_{i=1}^{n} w_{n, i}(\x) \mathds{1}_{Y_i = y^\star}$. A more detailed version of the algorithm is provided and discussed in Appendix.


To estimate the $CDP$ of a rule $CDP_{S}\left(\YSt; \boldsymbol{R}\right)=P\left(f(\X) \in \YSt \left|\boldsymbol{X}_{\bar{S}}\in \boldsymbol{R}_{\bar{S}} \right.\right)$, we just have to apply the projected Forest algorithm to the Regional RF, i.e., when a split involving a variable outside of $\bar{S}$ is met, the observations are sent both to the left and right children nodes, otherwise we use the Regional RF split rule, i.e., if an interval of $\boldsymbol{R}_{\bar{S}}$ is below $t$, the observations go to the left children, otherwise the right children and if $t$ is in the interval, the observations go to the left and right children. The estimator of the $CRP_S(\YSt; \boldsymbol{R}, C_S(\boldsymbol{R}))$ for rule is also derived from the Regional RF. Indeed, it is a special case of the Regional RF algorithm where there are only regional-based conditions.


\section{Learning the Counterfactual Rules}
We compute the Local and Regional CR using the estimators of the previous section. First, we find the Minimal Divergent Explanation in the same way as Minimal Sufficient Explanation can be found \citep{amoukou2021consistent}. As the exploration of all possible subsets is exponential, we search the Minimal Divergent Subset among  the $K=10$ most frequently selected variables in the RF $m_{k,n}$  used to estimate the probabilities $CDP_S, CRP_S$ ($K$ is an hyper-parameter to select according to the use case and computational power). We can also use any importance measure.  

 Given an instance $\boldsymbol{x}$ or rectangle $\boldsymbol{R}$ (and set $\YSt$) and their corresponding Minimal Divergent Explanation S, we want to find a rule  $C_S(\boldsymbol{x}) = \prod_{i \in S} [a_i, b_i]$ s.t. given $\boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}}$ or $\boldsymbol{X}_{\bar{S}} \in \boldsymbol{R}_{\bar{S}}$ and $\XS \in C_S(\boldsymbol{x})$, the probability that $Y \in \YSt$ is high. More formally, we want: $P(f(\X) \in \YSt | \XS \in C_S(\boldsymbol{x}), \boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}})$ or $P(f(\X) \in \YSt| \XS \in C_S(\boldsymbol{x}), \boldsymbol{X}_{\bar{S}} \in \boldsymbol{R}_{\bar{S}} )$ above $\pi_C$.
 
The computation of the rectangles $C_S(\boldsymbol{x}) = \prod_{i\in S|} [a_i, b_i]$ relies heavily on our use of RF and on the algorithmic trick of the projected RF. Indeed, the rectangles defining the rules arise naturally from RF, while  AReS \citep{rawal2020beyond} relies on binned variables to generate candidate rules and tests all these possible rules for choosing an optimal one. We overcome the computational burden and the challenge of choosing the number of bins.

\begin{figure}[!htb]
\minipage{0.30\textwidth}
  \includegraphics[width=\linewidth]{figures/tree_partition.png}
  \caption{The partition of the RF learned to classify the toy data (Green/Blue stars). Its has 10 leaves. The explainee $\boldsymbol{x}$ is the Blue triangle in leaf 5. }\label{fig:forest_part}
\endminipage\hfill
\minipage{0.30\textwidth}
  \includegraphics[width=\linewidth]{figures/projected_partition.png}
  \caption{The partition of the projected Forest when we condition on $X_0$, i.e., ignoring the splits based on $X_1$ (the dashed lines).}\label{fig:projected_part}
\endminipage\hfill
\minipage{0.30\textwidth}%
  \includegraphics[width=\linewidth]{figures/counterfactual_rule.png}
  \caption{The optimal CR for $\boldsymbol{x}$ when we condition given $X_0=x_0$ is the Green region, its corresponds to the union of leaf 3 and 4 of the forest}\label{fig:cr}
\endminipage
\end{figure}

 
To illustrate the idea, we use a two-dimensional data $(X_0, X_1)$ with label Y represented as Green/Blue stars in figure \ref{fig:forest_part}. We fit a Random Forest to classify this dataset and show its partition in figure \ref{fig:forest_part}. The explainee $\boldsymbol{x}$ is the Blue triangle observation.
 
By looking at the different cells/leaves of the RF, we can guess that the Minimal Divergent Explanation of $\boldsymbol{x}$ is $S = X_1$. Indeed, in figure \ref{fig:projected_part}, we observe the leaves of the Projected Forest when we do not condition on $S = X_1$, thus projected the RF's partition only on the subspace $X_0$. Its consists of ignoring all the splits in the other directions (here the $X_1$-axis), thus $\boldsymbol{x}$ falls in the projected leaf 2 (see figure \ref{fig:projected_part}) and its $CDP$ is  $CDP_{X_1}(\text{Green}; \boldsymbol{x})=\frac{10 \text{ Green}}{10\text{ Green} + 17\text{ Blue}} = 0.58$.
 
Finally, the problem of finding the optimal rectangle $C_S(\boldsymbol{x}) = [a_i, b_i]$ in the direction of $X_1$ s.t. the decision changes can be easily solved by using the leaves of the RF. In fact, by looking at the leaves of the RF (figure \ref{fig:forest_part}) of the observations that belong in the projected RF leaf 2 (figure \ref{fig:projected_part}) where $\boldsymbol{x}$ falls, we see in figure \ref{fig:cr} that the optimal rectangle to change the decision given $X_0 = x_0$ or being in the projected RF leaf 2 is the union of the intervals on $X_1$ of the leaf 3 and 4 of the RF (see the Green region of figure \ref{fig:cr}). 
 
Given an instance $\boldsymbol{x}$ and its Minimal Divergent Explanation $S$, the first step is the collect of the observations which belong to the leaf of the Projected Forest given $\bar{S}$ where $\boldsymbol{x}$ falls. It corresponds to the observations that has positive weights in the computation of the $CDP_S(\YSt; \boldsymbol{x}) = \sum_{i=1}^{n} w_{n, i}(\boldsymbol{x}_{\bar{S}}) \mathds{1}_{Y_i \in \YSt}$, i.e., $\{\x_i: w_{n, i}(\boldsymbol{x}_{\bar{S}}) >0\}$. Then, we used the partition of the original forest to find the possible leaves $C_S(\boldsymbol{x})$ in the direction $S$. The possible leaves is among the RF's leaves of the collected observations $\{\boldsymbol{x}_i: w_{n,i}(\boldsymbol{x}_{\bar{s}}) >0\}$. Let denote $L(\boldsymbol{x}_i)$ the leaves of the observations $\x_i$ with $w_{n, i}(\boldsymbol{x}_{\bar{S}}) >0$. A possible leaf is a leaf $L(\boldsymbol{x}_i)$ s.t. $CRP_S(\YSt, \boldsymbol{x}, L(\boldsymbol{x}_i)_S) = P( f(\X) \in \YSt | \XS \in L(\x_i)_S, \boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}}) \geq \pi_C$. Finally, we merge all the neighboring possible leaves to get the largest rectangle, and this maximal rectangle is the counterfactual rule. Note that the union of the possible leaves is not necessary a connected space, thus we can have multiple counterfactual rules.

We apply the same idea to find the regional CR. Given a rule $\boldsymbol{R}$ and its Minimal Divergent Explanation $S$, we used the Projection given $\boldsymbol{X}_{\bar{S}} \in \boldsymbol{R}_{\bar{S}}$ to find the compatible observations and their leaves and combine the possible ones to obtain the regional CR that has $CRP_S(\YSt, \boldsymbol{R}, C_S(\boldsymbol{R})) \geq \pi_C$. For example, if we consider the leaf 5 of the original forest as a rule: \texttt{If $\boldsymbol{X} \in $ Leaf 5, then predict Blue}. Its Minimal Divergent Explanation is also $S=X_1$. The R-CR would also be the Green region in figure \ref{fig:cr}. Indeed, if we satisfy the $X_0$ condition of the leaf 5 and $X_1$ condition of the leaf 3 and 4, then the decision change to Green. 







\section{Experiments}
To demonstrate the performance of our framework, we conduct two experiments on real-world datasets. The first consists of showing how we can use the \textit{Local Counterfactual Rules} for explaining a regression model. In the second experiment, we compare our approaches with the 2 baselines methods in classification problem: (1) \textbf{CET} \citep{cet4}, which partition the input space using a decision tree and associate a vector perturbation for each leaf, (2) \textbf{AReS} \citep{rawal2020beyond} performs an exhaustive search for finding global counterfactual rules, but we used the implementation of \cite{cet4} that adapts the algorithm for returning counterfactuals samples instead of rules. We compare the methods only in classification problem as most prior works do not deal regression problem. In all experiments, we split our dataset into train ($75\%$) - test ($25\%$), and we learn a model $f$, a LightGBM \textit{(estimators=50, nb leaves=8)}, on the train set that is the explainee. We learn $f$'s predictions on the train set with an approximating  RF $m_{nb,n}$ \textit{(estimators=20, max depth=10)}:  \textbf{that}  will be used to generate the CR with $\pi=0.9$. The used parameters for \textbf{AReS}, \textbf{CET} are \textit{max rules=8, bins=10} and \textit{max iterations=1000, max leaf=8, bins=10} respectively. Due to page limitation, the detailed parameters of each method are provided in Appendix.

\paragraph{Sampling CE using the Counterfactual Rules:} Notice that our approaches cannot be directly compare with the baseline methods since they all return counterfactual samples while we give rules (range of vector values) that permit to change the decision with high probability. However, we adapt the CR to generate also counterfactual samples using a generative model. For example, given an instance $\x = (\xs, \x_{\bar{S}})$, target $\YSt$ and its counterfactual rule $C_S(\x; \YSt)$, we want to find a sample $x^\star = (\boldsymbol{z}_S, \x_{\bar{S}})$ with $\boldsymbol{z}_S \in C_S(\x, \YSt)$ s.t  $\x^\star$ is an in-distribution sample and $f(\x^\star) \in \YSt$.
Instead of using a complex conditional generative model as \citep{modeling_td, sdv} that can be difficult to calibrate, we use an energy-based generative approach \citep{ebmduvenaud, yanebm}. The core idea is to find $\boldsymbol{z}_S \in C_S(\x, y^\star)$ s.t. $\x^\star$ maximize a given energy score to ensure that it is an in-distribution sample. As an example of an energy function, we use the negative outlier score of an Isolation Forest \citep{liu2008isolation}. We use Simulated Annealing (see \citep{review_simulated_annealing} for a review) to maximize the negative outlier score using the information of the counterfactual rules $C_S(\x; \YSt)$. In fact, the range values given by the CR $C_S(\x; \YSt)$ reduce the search space for $\boldsymbol{z}_S$ drastically. We used the training set $\mathcal{D}_n$ to find the possible values i.e., we defined $P_i$, $P_S$ as the list of values of the variable $i \in S$ found in $\mathcal{D}_n$ and $P_S = \{ \boldsymbol{z}_S = (z_1, \dots, z_S): \boldsymbol{z}_S \in C_S(\x, y^\star), z_i \in P_i\}$ the possible values of $\boldsymbol{z}_S$ respectively. Then, we sample $\boldsymbol{z}_S$ in the set $P_S$ and use Simulated Annealing to find a $\x^\star$ that maximizes the negative outlier score. Note that the algorithm is the same for sampling CE with the Regional-CR. A more detailed version of the algorithm is provided in Appendix. 


Finally, we compare the methods on unseen observations using three criteria. \textit{Correctness} is the average number of instances for which acting as prescribed change to the desired prediction. \textit{Plausibility} is the average number of inlier (predict by an Isolation Forest) in the counterfactual samples. \textit{Sparsity} is the average number of features that have been changed, and especially for the global counterfactual methods (AReS, Regional-CR) that do not ensure to cover all the instances, we compute \textit{Coverage} that corresponds to the average number of unseen observations we cover.

\paragraph{Local counterfactual rules for regression:} We give recourse for the \textbf{California House Price} dataset \citep{california_data} derived from the 1990 U.S. census. We have information about each district (demography, \dots), and the goal is to predict the median house value of each district.

To illustrate the efficiency of the Local-CR, we select all the observations in the test set having a price lower than $100k$ (1566 houses), and we aim to find the recourse that permit to increase their price : we want the price $y$ to be in the interval $\YSt=[200k, 250k]$. For each instance $\x$, we compute the Minimal Divergent Explanation $S$, the Local-CR $C_S(\x; [200k, 250k])$ and a CE using the Simulated Annealing as described above. We succeed in changing the decision of all the observations $(\textit{Correctness}=1)$ and most of them passed the outlier test with $\textit{Plausibility}=0.92$. On top of that, our Local-CR have sparse support ($\textit{Sparsity}=4.45$). For example, the Local-CR of the instance 
$\x =$ \texttt{(Longitude=-118.2, latitude=33.8, housing median age=26, total rooms=703, 
total bedrooms=202, population=757, households=212, median income=2.52)} is $C_S(\x, [200k, 250k]) =$\texttt{ (total room $\in [2132, 3546],$ total bedrooms $\in [214, 491]$)}. It means if \texttt{total room and total bedrooms} satisfy the conditions in $C_S(\x, [200k, 250k])$ and the remaining features of $\x$ is fixed, then the probability that the price is in $[200k, 250k]$ is 0.97. 



\paragraph{Comparisons of Local-CR and Regional-CR with baselines (AReS, CET):} We use 3 real-world datasets: \textbf{Diabetes} \citep{diabetes} contains diagnostic measurements and aims to predict whether or not a patient has diabetes, \textbf{Breast Cancer Wisconsin (BCW)} \citep{UCI} consists of predicting if a tumor is benign or not using the characteristic of the cell nuclei, and \textbf{Compas} \citep{compasdata} was used to predict recidivism, and it contains information about the criminal history, demographic attributes. During the evaluation, we observe that \textbf{AReS, CET} are very sensitive to the number of bins and the maximal number of rules or actions as noticed by \citep{globalce}. A bad parameterization gives completely useless explanations. Moreover, a different model needs to be trained for each class to be accurate, while we only need to have a RF that has good precision. 

In table \ref{tab:results}, we notice that the Local and Regional-CR succeed in changing decisions with a high accuracy in all datasets, outperforming \textbf{AReS} and \textbf{CET} with a large margin on \textbf{BCW}, and \textbf{Diabetes}. Moreover, we notice that the baselines struggle to change at the same time the positive and negative class, (e.g. CET has \textit{Acc}=1 in the positive class, and 0.21 for the negative class on \textbf{BCW}) or when they have a good \textit{Acc}, the CE are not plausible. For instance, CET has \textit{Acc}=0.98 and \textit{Psb}=0 on \textbf{Compas}, meaning that all the CE are outlier. Regarding the coverage of the global CE, CET covers all the instances as it partitions the space, but we observe that \textbf{AReS} has a smaller  \textit{Coverage}$=\{0.43, 0.44, 0.81\}$ than the Regional-CR which has $\{1, 0.7, 1\}$ for \textbf{BCW, Diabetes, and Compas} respectively. To sum up, the CR is easier to train and provides more accurate and plausible rules than the baselines methods.


\begin{table}[ht!]
\caption{Results of the \textit{Correctness} (Acc), \textit{Plausibility}, and \textit{Sparsity} (Sprs) of the different methods. We compute each metric according to the positive (Pos) and negative (Neg) class.}
\label{tab:results}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{ccccccccccccccccccc}
\cline{2-19}
 &
  \multicolumn{6}{c}{\textbf{COMPAS}} &
  \multicolumn{6}{c}{\textbf{BCW}} &
  \multicolumn{6}{c}{\textbf{Diabetes}} \\ \cline{2-19} 
 &
  \multicolumn{2}{c}{Acc} &
  \multicolumn{2}{c}{Psb} &
  \multicolumn{2}{c|}{Sps} &
  \multicolumn{2}{c}{Acc} &
  \multicolumn{2}{c}{Psb} &
  \multicolumn{2}{c|}{Sps} &
  \multicolumn{2}{c}{Acc} &
  \multicolumn{2}{c}{Psb} &
  \multicolumn{2}{c}{Sps} \\ \cline{2-19} 
 &
  Pos &
  Neg &
  Pos &
  Neg &
  Pos &
  \multicolumn{1}{c|}{Neg} &
  Pos &
  Neg &
  Pos &
  Neg &
  Pos &
  \multicolumn{1}{c|}{Neg} &
  Pos &
  Neg &
  Pos &
  Neg &
  Pos &
  Neg \\
\textbf{L-CR} &
  1 &
  0.9 &
  0.87 &
  0.73 &
  2 &
  \multicolumn{1}{c|}{4} &
  1 &
  1 &
  0.96 &
  1 &
  9 &
  \multicolumn{1}{c|}{7} &
  0.97 &
  1 &
  0.99 &
  0.8 &
  3 &
  4 \\
\textbf{R-CR} &
  0.9 &
  0.98 &
  0.74 &
  0.93 &
  2 &
  \multicolumn{1}{c|}{3} &
  0.89 &
  0.9 &
  0.94 &
  0.93 &
  9 &
  \multicolumn{1}{c|}{9} &
  0.99 &
  0.99 &
  0.9 &
  0.87 &
  3 &
  4 \\
\textbf{AReS} &
  0.98 &
  1 &
  0.8 &
  0.61 &
  1 &
  \multicolumn{1}{c|}{1} &
  0.63 &
  0.34 &
  0.83 &
  0.80 &
  4 &
  \multicolumn{1}{c|}{3} &
  0.73 &
  0.60 &
  0.77 &
  0.86 &
  1 &
  1 \\
\textbf{CET} &
  0.85 &
  0.98 &
  0.7 &
  0 &
  2 &
  \multicolumn{1}{c|}{2} &
  1 &
  0.21 &
  0.6 &
  0.80 &
  8 &
  \multicolumn{1}{c|}{2} &
  0.84 &
  1 &
  0.60 &
  0.20 &
  6 &
  6
\end{tabular}%
}
\end{table}

\section{Conclusion}
Most current works that generate CE are implicit through an optimization process or a brunch of random samples, thus lacking guarantees. For this reason, we rethink CE as \textit{Counterfactual Rules}. For any individual or sub-population, it gives the simplest policies that change the decision with high probability. Our approach learns robust, plausible, and sparse adversarial regions where the observations should be moved. We make central use of Random Forests, which give consistent estimates of the interest probabilities and naturally give the counterfactual rules we want to extract. In addition, it permits us to deal with
  regression problems and continuous features. Consequently, our methods are suitable for all datasets where tree-based model performs well (e.g., tabular data). A prospective work is to evaluate the robustness of our methods to noisy human responses, i.e., when the prescribed recourse is not implemented exactly, and to refine the methodology for selecting the threshold probabilities $\pi$ and $\pi_C$.






\newpage


\section{Regional RF detailed}
In this section, we give a simple application of the Regional RF algorithm to better understand how it works. Recall that the regional RF is a generalization of the RF's algorithm to give prediction even when we condition given a region, e.g., to estimate $E(f(\X) \; | \boldsymbol{X}_S \in C_S(\boldsymbol{x}), \boldsymbol{X}_{\bar{S}} = \boldsymbol{x}_{\bar{S}})$ with $C_{S}(\boldsymbol{x}) = \prod_{i=1}^{|S|} [a_i, b_i], a_i, b_i \in \bar{\mathbb{R}}$ a hyperrectangle. The algorithm works as follows: we drop the observations in the initial trees, if a split used variable $i \in \bar{S}$, a fixed value-based condition, we used the classic rules i.e.,  if $x_i \leq t$, the observations go to the left children, otherwise the right children. However, if a split used variable $i \in S$, regional-based condition, we used the hyperrectangle $C_S(\boldsymbol{x}) = \prod_{i=1}^{|S|} [a_i, b_i]$. The observations are sent to the left children if $b_i \leq t$, right children if $a_i > t$ and if $t \in [a_i, b_i]$ the observations are sent both to the left and right children. 

To illustrate how it works, we use a two dimensional variables $\X \in \mathbb{R}^2$, a simple decision tree $f$ represented in figure \ref{fig:tree_example}, and want to compute for $\x = [1.5, 1.9],$ $E(f(\X) | \boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5)$. We assume that $P(X_1 \in [2, \; 3.5] \; | X_0 = 1.5) >0$ and denoted $T_1$ as the set of the values of the splits based on variables $X_1$ of the decision tree. One way of estimating this conditional mean is by using Monte Carlo sampling. Therefore, there are two cases : 

\begin{figure}[ht!]
   
    \centering
    \includegraphics[scale=0.5]{figures/illustration_neurips.png}
    \caption{Representation of a simple decision tree (right figure) and its associated partition (left figure). The gray part in the partition corresponds to the region $[2, \; 3.5] \times [1, 2]$}
    \label{fig:tree_example}
\end{figure}
\begin{itemize}
   
   
    \item If $\forall t \in T_1,$ $t \leq 2$ or $t > 3$, then all the observations sampled s.t. $\Tilde{X}_i \sim \mathcal{L} (\X \; |\boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5)$ follow the same path and fall in the same leaf. The Monte Carlo estimator of the decision tree $E(f(\X) | \boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5)$ is equal to the output of the Regional RF algorithm. 
    \begin{itemize}
        \item For instance, a special case of the case above is: if $\forall t \in T_1, t \leq 2$, and we sample using $\mathcal{L} (\X \; |\boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5)$,  then all the observations go to the right children when they encounters a node using $X_1$ and fall in the same leaf. 
    \end{itemize}
    \item If $\exists \; t \in T_1$ and $t \in [2, \; 3.5]$, then the observations sampled s.t. $\Tilde{X}_i \sim \mathcal{L} (\X \; |\boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5)$ can fall in multiple terminal leaf depending on if their coordinates $x_1$ is lower than $t$. Following our example,  if we generate samples using $\mathcal{L} (\X \; |\boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5)$, the observations will fall in the gray region of figure \ref{fig:tree_example}, and thus can fall in node 4 or 5. Therefore, the true estimate is: 
    \begin{align}
         & E(f(\X) | \boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5 ) \nonumber\\
         & = p(X_1 \leq 2.9\; | X_0=1.5)*E[f(\X)\;| \X \in L_4] + p(X_1 > 2.9\; | X_0=1.5)*E[f(\X)\; |\X \in L_5] \label{fig:weighted_mean}
    \end{align}
\end{itemize}


Concerning the last case $(t \in [2, \; 3.5])$, we need to estimate the different probabilities $p(X_1 \leq 2.9\; | X_0=1.5), p(X_1 > 2.9\; | X_0=1.5)$ to compute $E(f(\X) | \boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5 )$, but these probabilities are difficult to estimate in practice. However, we argue that we can ignore these splits, and thus do no need to fragment the query region using the leaves of the tree. Indeed, as we are no longer interest in a point estimate but regional (population mean) we do not need to go to the level of the leaves. We propose to ignore the splits of the leaves that divide the query region.  For instance, the leaves 4 and 5 split the region $[2, \; 3.5]$ in two cells, by ignoring these splits we estimate the mean of the gray region by taking the average output of the leaves 4 and 5 instead of computing the mean weighted by the probabilities as in Eq. \ref{fig:weighted_mean}. Roughly, it consists to follow the classic rules of a decision tree (if the region is above or below a split) and ignore the splits that are in the query region, i.e., we average the output of all the leaves that are compatible with the condition $\boldsymbol{X}_1 \in [2, \; 3.5], \boldsymbol{X}_{0} = 1.5$. 
We think that it leads to a better approximation for two reasons. First, we observe that the case where t is in the region and thus divides the query region does not happen often. Moreover, the leaves of the trees are very small in practice, and taking the mean of the observations that fall in the union of leaves that belong to the query region is more reasonable than computing the weighted mean and thus trying to estimate the different probabilities $p(X_1 \leq 2.9\; | X_0=1.5), p(X_1 > 2.9\; | X_0=1.5)$.

\section{Additional experiments}
In table \ref{tab:add_exp}, we compare the \textit{Correctness} (Acc), \textit{Plausibility} (Psb), and \textit{Sparsity} (Sprs) of the different methods on additonal real-world datasets: FICO \citep{helocdata}, NHANESI \citep{nhanes}. 

We observe that the L-CR, and R-CR outperform the baseline methods by a large margin on \textit{Correctness} and \textit{Plausibility}. The baseline methods still struggle to change at the same time the positive and negative class. In addition, AReS and CET give better sparsity, but their counterfactual samples are less plausible than the ones generated by the CR.


\begin{table}[ht!]
\caption{Results of the \textit{Correctness} (Acc), \textit{Plausibility}, and \textit{Sparsity} (Sprs) of the different methods. We compute each metric according to the positive (Pos) and negative (Neg) class.}
\label{tab:add_exp}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{ccccccccccccc}
\cline{2-13}
              & \multicolumn{6}{c}{\textbf{FICO}}                           & \multicolumn{6}{c}{\textbf{NHANESI}}  \\ \cline{2-13} 
 & \multicolumn{2}{c}{Acc} & \multicolumn{2}{c}{Psb} & \multicolumn{2}{c|}{Sps} & \multicolumn{2}{c}{Acc} & \multicolumn{2}{c}{Psb} & \multicolumn{2}{c}{Sps} \\ \cline{2-13} 
              & Pos  & Neg  & Pos  & Neg  & Pos & \multicolumn{1}{c|}{Neg}  & Pos  & Neg  & Pos  & Neg  & Pos & Neg \\
\textbf{L-CR} & 0.98 & 0.94 & 0.98 & 0.99 & 5   & \multicolumn{1}{c|}{5}    & 0.99 & 0.98 & 0.98 & 0.97 & 5   & 6   \\
\textbf{R-CR} & 0.90 & 0.94 & 0.98 & 0.99 & 9   & \multicolumn{1}{c|}{8.43} & 0.86 & 0.95 & 0.96 & 0.99 & 7   & 7   \\
\textbf{AReS} & 0.34 & 0.01 & 0.85 & 0.86 & 2   & \multicolumn{1}{c|}{1}    & 0.06 & 1    & 0.87 & 0.92 & 1   & 1   \\
\textbf{CET}  & 0.76 & 0    & 0.76 & 0.60 & 2   & \multicolumn{1}{c|}{2}    & 0    & 0.40 & 0.82 & 0.56 & 0   & 5  
\end{tabular}%
}
\end{table}
\section{Simulated annealing to generate counterfactual samples using the Counterfactual Rules}
\begin{lstlisting}[language=Python, caption=The simulated annealing algorithm to generate samples that satisfy the condition CR]
import numpy as np
    
def generate_candidate(x, S, x_train, C_S, n_samples):
    """
    Generate sample by sampling marginally between the features value of the training observations.
    Args:
        x (numpy.ndarray)): 1-D array, an observation 
        S (list): contains the indices of the variables on which to condition
        x_train (numpy.ndarray)): 2-D array represent the training samples
        C_S (numpy.ndarray)): 3-D (#variables x 2 x 1) representing the hyper-rectangle on which to condition
        n_samples (int): number of samples 
    Returns:
        The generated samples
    """
    x_poss = [x_train[(C_S[i, 0] <= x_train[:, i]) * (x_train[:, i] <= C_S[i, 1]), i] for i in S]
    x_cand = np.repeat(x.reshape(1, -1), repeats=n_samples, axis=0)
    
    for i in range(len(S)):
        rdm_id = np.random.randint(low=0, high=x_poss[i].shape[0], size=n_samples)
        x_cand[:, S[i]] = x_poss[i][rdm_id]

    return x_cand


def simulated_annealing(outlier_score, x, S, x_train, C_S, batch, max_iter, temp, max_iter_convergence):
    """
    Generate sample X s.t. X_S \in C_S using simulated annealing and outlier score.
    Args:
        outlier_score (lambda functon): outlier_score(X) return a outlier score. If the value are negative, then the observation is an outlier.
        x (numpy.ndarray)): 1-D array, an observation 
        S (list): contains the indices of the variables on which to condition
        x_train (numpy.ndarray)): 2-D array represent the training samples
        C_S (numpy.ndarray)): 3-D (#variables x 2 x 1) representing the hyper-rectangle on which to condition
        batch (int): number of sample by iteration
        max_iter (int): number of iteration of the algorithm
        temp (double): the temperature of the simulated annealing algorithm
        max_iter_convergence (double): minimun number of iteration to stop the algorithm if it find an in-distribution observation

    Returns:
        The generated sample, and its outlier score
    """
    
    best = generate_candidate(x, S, x_train, C_S, n_samples=1)
    best_eval = outlier_score(best)[0]
    curr, curr_eval = best, best_eval

    it = 0
    for i in range(max_iter):

        x_cand = generate_candidate(curr, S, x_train, C_S, batch)
        score_candidates = outlier_score(x_cand)

        candidate_eval = np.max(score_candidates)
        candidate = x_cand[np.argmax(score_candidates)]

        if candidate_eval > best_eval:
            best, best_eval = candidate, candidate_eval
            it = 0
        else:
            it += 1

        # check convergence
        if best_eval > 0 and it > max_iter_convergence:
            break

        diff = candidate_eval - curr_eval
        t = temp / np.log(float(i + 1))
        metropolis = np.exp(-diff / t)

        if diff > 0 or rand() < metropolis:
            curr, curr_eval = candidate, candidate_eval

    return best, best_eval
\end{lstlisting}
    




\section{Parameters detailed}
In this section, we give the different parameters of each method. For all methods and datasets, we first used a greedy search given a set of parameters. For AReS, we use the following set of parameters:
\begin{itemize}
    \item max rule = $\{4, 6, 8\}$, max rule length $=\{4, 8 \}$, max change num $= \{2, 4, 6\}$,
    \item minimal support $= 0.05$, discretization bins = $\{ 10, 20\}$,
    \item $\lambda_{acc} = \lambda_{cov} = \lambda_{cst} = 1$.
   
\end{itemize}
For CET, we search in the following set of parameters: 
\begin{itemize}
    \item max iterations $ = \{500, 1000\}$,
    \item max leaf size $= \{ 4, 6, 8, -1\}$,
    \item $\lambda = 0.01, \gamma = 1 $.
\end{itemize}

Finally, for the Counterfactual Rules, we used the following parameters:
\begin{itemize}
    \item nb estimators = $\{20, 50 \}$, max depth= $\{8, 10, 12\}$,
    \item $\pi=0.9$, $\pi_C=0.9$.
\end{itemize}
We obtained the same optimal parameters for all datasets:
\begin{itemize}
    \item AReS:  max rule $= 4$, max rule length$= 4$, max change num $= 4$, minimal support $= 0.05$, discretization bins = $10$, $\lambda_{acc} = \lambda_{cov} = \lambda_{cst} = 1$
    \item CET: max iterations $= 1000$, max leaf size $=-1$, $\lambda = 0.01, \gamma = 1 $
    \item CR: nb estimators$= 20$, max depth$=10$, $\pi=0.9$, $\pi_C=0.9$
\end{itemize}

The code and the results can be found at \url{https://github.com/anoxai/counterfactual_rules}.




\newpage

