%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newcommand{\edits}[1]{\textcolor{blue}{#1}}

\title{Interpretable Differencing of Machine Learning Models}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<swagatam.haldar@ibm.com>?Subject=Your UAI 2023 paper}{Swagatam~Haldar}{}}
\author[1]{Diptikalyan~Saha}
\author[2]{Dennis~Wei}
\author[3]{Rahul~Nair}
\author[3]{Elizabeth~M.~Daly}
% Add affiliations after the authors
\affil[1]{%
    IBM Research\\
    Bangalore, India
}
\affil[2]{%
    IBM Research\\
    Yorktown Heights, New York, USA\\
%    …
}
\affil[3]{%
    IBM Research\\
    Dublin, Ireland
%    …
  }

% If you use natbib package, activate the following three lines:
% \usepackage[round]{natbib}
% \renewcommand{\bibname}{References}
% \renewcommand{\bibsection}{\subsubsection*{\bibname}}

% If you use BibTeX in apalike style, activate the following line:
\bibliographystyle{apalike}
\usepackage{xr-hyper}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage[linesnumbered,ruled]{algorithm2e}
\usepackage{balance}

\usepackage{tikz}

\usetikzlibrary{shapes,arrows,positioning} 
\usetikzlibrary{shapes.geometric}
\tikzset{
    %Define standard arrow tip
    >=stealth',
    %Define style for boxes
    punkt/.style={
           rectangle,
           %rounded corners,
           draw=black,
           text width=6.5em,
           minimum height=3em,
           text centered},
     box/.style={
      rectangle, 
      %rounded corners,
      draw=black, 
      minimum width=2em,
      minimum height=2em,
      text centered
  },
    % Define arrow style
    pil/.style={
           ->,
           shorten <=2pt,
           shorten >=2pt,},
    % Define arrow style
    revpil/.style={
           <-,
           shorten <=2pt,
           shorten >=2pt,}
}


\newcommand{\eat}[1]{}
%\newcommand{\argmin}{\arg\!\min} % AlfC
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\imp}{imp}
\DeclareMathOperator*{\lab}{label}
\DeclareMathOperator*{\leaves}{leaves}
\DeclareMathOperator*{\pc}{pc}
\newcommand{\Dtest}{\mathcal{D}_{\mathrm{test}}}
\newcommand{\Dtrain}{\mathcal{D}_{\mathrm{train}}}
\newcommand{\Ttrue}{\mathcal{T}_{\mathrm{true}}}
\newcommand{\Tpred}{\mathcal{T}_{\mathrm{pred}}}

  
\begin{document}
\maketitle

\begin{abstract}
Understanding the differences between machine learning (ML) models is of interest in scenarios ranging from choosing amongst a set of competing models, to updating a deployed model with new training data. In these cases, we wish to go beyond differences in overall metrics such as accuracy to identify \emph{where} in the feature space do the differences occur. We formalize this problem of model \emph{differencing} as one of predicting a dissimilarity function of two ML models' outputs, subject to the representation of the differences being human-interpretable. Our solution is to learn a \emph{Joint Surrogate Tree} (JST), which is composed of two conjoined decision tree surrogates for the two models. A JST provides an intuitive representation of differences
and places the changes in the context of the models' decision logic. Context is important as it helps users to map differences to an underlying mental model of an AI system. We also propose a refinement procedure to increase the precision of a JST. 
We demonstrate, through an empirical evaluation, that such contextual differencing is concise and can be achieved with no loss in fidelity over naive approaches.
\end{abstract}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{INTRODUCTION}
\label{sec: intro}

% Motivation
At various stages of the AI model lifecycle, data scientists make decisions regarding which model to use. For instance, they may choose from a range of pre-built models, select from a list of candidate models generated from automated tools like AutoML, or simply update a model based on new training data to incorporate distributional changes. In these settings, the choice of a model is preceded by an evaluation that typically focuses on accuracy and other metrics, instead of how it differs from other models.   

% Summary statement of problem
We address the problem of model \emph{differencing}. Given two models for the same task and a dataset, we seek to learn where in the feature space the models' predicted outcomes differ. Our objective is to provide accurate and interpretable mechanisms to uncover these differences.



The comparison is helpful in several scenarios. In a \emph{model marketplace}, multiple pre-built models for the same task need to be compared. The models usually are black-box and possibly trained on different sets of data drawn from the same distribution. During \emph{model selection}, a data scientist trains multiple models and needs to select one model for deployment. In this setting, the models are white-box and typically trained on the same training data. For \emph{model change}, where a model is retrained with updated training data with a goal towards model improvement, the data scientist needs to understand changes in the model beyond accuracy metrics. Finally, \emph{decision pipelines consisting of logic and ML models} occur in business contexts where a combination of business logic and the output of ML models work together for a final output. Changes might occur either due to model retraining or adjustments in business logic which can impact the behavior of the overall pipeline.

% Summary of our proposal
In this work we address the problem of interpretable model differencing as follows. First, we formulate the problem as one of predicting the values of a {dissimilarity function} of the two models' outputs. We focus herein on $0$-$1$ dissimilarity for two 
%(possibly multi-class) 
classifiers, where $0$ means ``same output'' and $1$ means ``different'', so that prediction quality can be quantified by any binary classification metric such as precision and recall.
Second, we propose a method that learns a \emph{Joint Surrogate Tree} (JST), composed of two conjoined decision tree surrogates to jointly approximate the two models. The root and lower branches of the conjoined decision trees are common to both models, while higher branches (farther from root) may be specific to one model. %JSTs thus align the two surrogate models to allow easier comparison, while also avoiding the challenge of directly modelling the differences, which tend to be fragmented. We present a visualization of JSTs as an intuitive representation of model differences. 
A JST thus accomplishes two tasks at once: it provides interpretable surrogates for the two models while also aligning the surrogates for easier comparison and identification of differences. These aspects are encapsulated in a visualization of JSTs that we present.
%The setting we explore is similar to \cite{nair2021changed} which uses rule-based surrogates. Our approach avoids the two limitations of their method around biases in inducing rules and remove the one-to-one correspondence restriction. 
Third, a refinement procedure is used to grow the surrogates in selected regions, improving the {precision} of the dissimilarity prediction. 

% % Blurb about mental models.
Our design of jointly learning surrogates is motivated by the need to place model differences in the context of the overall decision logic. This can aid users who may already %tend to 
have a mental model of (individual) AI systems, either for debugging \citep{kulesza2012tell} or to understand errors \citep{bansal2019beyond}.

% Contribution statement
The main contributions of the paper are (a) a quantitative formulation of the problem of model differencing, and (b) algorithms to learn and refine conjoined decision tree surrogates to approximate two models simultaneously. A detailed evaluation of the method is presented on several benchmark datasets, showing more accurate or more concise representation of model differences, compared to baselines.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{RELATED WORKS}

Our work touches upon several active areas of research which we summarize based on key pertinent themes.

\paragraph{Surrogate models and model refinement}
One mechanism to lend interpretability to machine learning models is through surrogates, i.e., simpler human-readable models that mimic a complex model \citep{bucila2006model,ba2014do,hinton2015distilling,lopez-paz2016unifying}. Most relevant to this paper are works that use a decision tree as the surrogate \citep{TREPAN,bastani2017interpretability,frosst2017distilling}. \citet{bastani2017interpretability} showed that interpretable surrogate decision trees extracted from a black-box ML model allowed users to predict the same outcome as the original ML model. %The authors showed that by comparing the decision tree surrogates of two medical providers, differences in how diagnoses were reported could be identified.
\citet{freitas2014comprehensible} also discusses interpretability and usefulness of using decision trees as surrogates. None of these works however have considered jointly approximating two black-box models.

\paragraph{Decision tree generation with additional objectives}  \citet{chen2019robust} showed that decision tree generation is not robust and slight changes in the root node can result in a very different tree structure. %Work presented in 
\citet{chen2019robust, andriushchenko2019provably} focus on improving robustness when generating the decision tree while \citet{moshkovitz2021connecting} prioritises both robustness and interpretability. \citet{aghaei2019learning} use mixed-integer optimization to take fairness into account in the decision tree generation. However, none of these solutions consider the task of comparing two decision trees. 

%However, when considering the task of comparing two models, the additional challenge of ensuring the tree structure is relatively comparable becomes important. 

\paragraph{Predicting disagreement or shift}
Prior work has focused on identifying statistically whether models have significantly changed \citep{Bu2019_modelchangedetection,Geng2019_ChangeDetection,harel2014}, but not on where they have changed. \citet{cito2021explaining} present a model-agnostic rule-induction algorithm to produce interpretable rules capturing instances that are mispredicted with respect to their ground truth.

\paragraph{Comparing models} The ``distill-and-compare'' approach of \citet{tan2018distill} uses generalized additive models (GAMs) and fits one GAM to a black-box model and a second GAM to ground truth outcomes. While differences between the GAMs are studied to uncover insights, there is only one black-box model. \citet{Demsar2018DetectingCD} study concept drift by determining feature contributions to a model and observing the changes in contributions over time. Similarly, \citet{duckworth2021using} investigated changes in feature importance rankings pre- and post-COVID. This approach however does not localize changes to regions of the feature space. %could be used to identify data drift, however while it can be used to signal drift has occurred it does not provide insights on which populations in the data the predictions have changed and how those predictions have changed.
\citet{chouldechova2017fairer} compare models in terms of fairness metrics and identify groups in the data where two models have maximum disparity.
Prior work by \citet{nair2021changed}, which is most similar to our own, uses rule-based surrogates for two models and derives rules for where the models behave similarly. %The output of 
%Their method is more qualitative as it does not evaluate the accuracy of the rules in predicting model similarities or differences. %mappings. 
%In contrast to our solution, 
Their method biases the learning of the second surrogate based on inputs from the first model, a step they call grounding, and imposes a one-to-one mapping between rules in the two surrogates. This is a strict condition that may not hold in practice. Additionally, their method does not evaluate the accuracy of resulting rules in predicting model similarities or differences. Our approach addresses these limitations.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{PROBLEM STATEMENT AND PRELIMINARIES}
\label{sec:prob}

%\paragraph{Problem statement} 
We are given two predictive models $M_1, M_2: \mathcal{X} \to \mathcal{Y}$ mapping a feature space $\mathcal{X} \subset \mathbb{R}^d$ to an output space $\mathcal{Y}$, as well as a \emph{dissimilarity function} $D: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_+$ (where $\mathbb{R}_+$ means the non-negative reals including zero) for comparing the outputs of the two models. Our goal is to obtain a \emph{difference model} (``diff-model'' for short), $\hat{D}: \mathcal{X} \to \mathbb{R}_+$, that predicts the dissimilarity $D(M_1(x), M_2(x))$ well while also being interpretable. To construct $\hat{D}$, we assume access to a dataset $X \in \mathbb{R}^{n\times d}$ consisting of $n$ samples drawn i.i.d.~from a probability distribution $P$ over $\mathcal{X}$. This dataset does not have to have ground truth labels, in contrast to supervised learning, since supervision is provided by the models $M_1, M_2$. Prediction quality is measured by the expectation $\mathbb{E}[L(\hat{D}(X), D(M_1(X), M_2(X))]$ of one or more metrics $L: \mathbb{R}_+ \times \mathbb{R}_+ \to \mathbb{R}_+$ comparing $\hat{D}$ to $D$, where the expectation is with respect to $P$. In practice, these expectations are approximated empirically using a test set.

In this work, we focus on classification models $M_1$ and $M_2$, so that $\mathcal{Y}$ is a finite set, and $0$-$1$ dissimilarity $D(M_1(x), M_2(x)) = 1$ if $M_1(x) \neq M_2(x)$ and $D(M_1(x), M_2(x)) = 0$ otherwise. 
% Extension to regression models is discussed in Section~\ref{sec:concl}. 
Accordingly, the predictions $\hat{D}(x)$ are also binary-valued and any binary classification metrics $L$ may be used for evaluation. Herein we use precision, recall, and F1-score (described in Section~\ref{sec:expt}). 

%To ensure that the difference model $\hat{D}$ is interpretable, we restrict attention to decision trees as the model class and
We use decision trees as the basis for our Joint Surrogate Tree solution. To ensure interpretability, the height (also referred to as maximum depth) is constrained to a small value (e.g.~$6$ in our experiments). Below we define notation and terminology related to decision trees for later use.


\paragraph{Decision Tree} A decision tree is a binary tree $T=(V_{dt},E_{dt})$ with a node set $V_{dt}$, a root node $r \in V_{dt}$ and a directed set of edges $E_{dt} \subset V_{dt} \times V_{dt}$. Each internal node $v \in V_{dt}$ contains a split condition $s(v) := f(v) < t(v)$ containing a predicate on feature $f(v) \in [d]$ (where $[d]$ is the  shorthand for $\{1\ldots d\}$), and a threshold $t(v) \in \mathbb{R}$, and  two children $v_T$ and $v_F$. The edges $(v,v_T)$ and $(v,v_F)$ are annotated with edge conditions $f(v) <  t(v)$ and $f(v) \ge  t(v)$, respectively. Each leaf node $v$ contains a label $\lab(v) \in \mathcal{Y}$. All leaf nodes of a tree rooted at $r$ are denoted as $\leaves(r)$. Given a node $v$, path-condition of $v$ (denoted as $\pc(v)$) is defined as the conjunction of all edge conditions from $r$ to $v$. At a given node $v \in V_{dt}$, we denote by $X_v, y_v$ %and $X_v[f]$ as 
the subset of samples that satisfy the $pc(v)$ and their labels, and we use $X_v[f]$ to denote the set of values for the feature $f\in [d]$. Without loss of generality, $s(v)$ is formed by minimizing function $H$, for all features and their values. We express  the split condition at node $v$ as $s(v) = c(X_v,y_v)$ and the minimum objective value (impurity) by $\imp(X_v,y_v)$:
\begin{align} 
c(X_v,y_v)&=\argmin_{\{f\in[d],\,t\in X_v[f]\}} H(f,t,X_v,y_v)\\
\imp(X_v,y_v)&=\min_{\{f\in[d],\,t\in X_v[f]\}} H(f,t,X_v,y_v) \label{eq:imp}
\end{align}

For example, $H$ can be instantiated as the weighted sum of  entropy values of left and right split~\citep{RQ}. We now describe two baseline approaches to the problem before presenting our proposed algorithm in Section~\ref{sec:algo}.

\paragraph{Direct difference modelling} Given the above problem statement, a natural %naive 
way to predict the dissimilarity function $D$ is to let $\hat{D}$ be a single ML model, in our case a decision tree for interpretability, and train it to classify between $D=0$ (models $M_1, M_2$ having the same output) and $D=1$ (different output). We call this the \emph{direct} approach. The main drawback of direct differencing is that even when using an interpretable decision tree, it does not capture the differences between the two models in the context of their human-interpretable decision processes, i.e., where in the decision logic of the models do the differences occur. 

\paragraph{Surrogate modelling} Another natural %naive 
way to model the dissimilarity is to separately build a decision tree surrogate $\hat{M}_i$ for each input model $M_i$, $i=1,2$, using the outputs of $M_i$ on the input samples $X$ for training the surrogate. Then we predict $\hat{D}(x) = 1$ if $\hat{M}_1(x) \neq \hat{M}_2(x)$ and $\hat{D}(x) = 0$ otherwise. We call this the \emph{separate surrogate} approach. Its drawback is that the two decision tree surrogates are not aligned, making it cumbersome %and un-intuitive 
for human comparison. In Section~\ref{sec:expt}, we show that the manifestation of this drawback is the large number of rules (see next paragraph) needed to describe all the regions where the two surrogates differ. 

\paragraph{Diff rules as output} 
We use \emph{diff rules} as an interpretable representation of model differences for both direct and surrogate tree-based diff models. A diff rule is a conjunction of conditions on individual features that, when satisfied at a point $x$, yields the prediction $\hat{D}(x) = 1$. Corresponding to each diff rule is a \emph{diff region}, the set of $x$'s that satisfy the rule. A \emph{diff ruleset} $\mathcal{R}$ is a set of diff rules such that if $x$ satisfies any rule in the set, we predict $\hat{D}(x) = 1$. For a direct decision tree model, the diff rules are given by the path conditions of the $\hat{D}(x) = 1$ leaves. For surrogate models $\hat{M}_1, \hat{M}_2$, the diff rules are conjunctions of path conditions for pairs of intersecting leaves where $\hat{M}_1(x) \neq \hat{M}_2(x)$.


\begin{figure}[t]
    \centering
    \resizebox{0.7\columnwidth}{!}{
\begin{tikzpicture}[node distance=0.5cm, auto,thick]
  % steps
  \node[punkt, align=center] (jst) {Joint Surrogate \\ Tree Builder};
  \node[punkt, below=of jst] (diff) {Diff Ruleset\\ Builder}
     edge[revpil] (jst);
  \node[draw,diamond,aspect=2,align=center, inner sep=0.1mm, below=of diff](inter){Want more\\ precision?};
    %Still\\ Interpretable?};
     edge[revpil] (diff);
  %\node[punkt, below left=of inter](sample){Sample Generation};
  %\node[draw,diamond, aspect=2, align=center, xshift=-1em, yshift=-3em, inner sep=0.1mm, below left=of inter](fid){Low\\ Fidelity?};
  \node[punkt, below right=of inter,yshift=0.8em](outd){Output\\ Diff Ruleset};
  \node[punkt, left=of diff](ref){JST\\ Refinement};
  \node[box,above=of jst](in_s){Dataset $X$};
  \node[box, left=of in_s](M1){$M_1$};
  \node[box, right=of in_s](M2){$M_2$};
 
  \draw[pil] (in_s) edge (jst)
     (jst) edge (diff)
     (diff) edge (inter);
 
 \draw[pil] (inter.west) -| (ref.south)
     node[pos=0.2,fill=white,inner sep=0.2em, anchor=south]{Yes};
 \draw[pil] (inter.east) -| (outd.north)
     node[pos=0.2,fill=white,inner sep=0.2em, anchor=south]{No};
 
 \draw[pil](M1.south) |- (jst.west);
 \draw[pil](M2.south) |- (jst.east);
 
 %\draw[pil] (sample) to (fid);
 \draw[pil] (ref) to (diff);
 
% \draw[pil] (fid.west) -- node [anchor=south]{Yes} ++(-1,0) |- (ref.west);
 
 %\draw[pil] (fid.east) -| (outd.south)
  %   node[pos=0.1,fill=white,inner sep=0.2em]{No};
 
 \end{tikzpicture}
        }
%\includegraphics[height=3in]{images/method.pdf}
\caption{Method Overview}
\label{fig:method}
\end{figure}

\section{PROPOSED ALGORITHM}
\label{sec:algo}

We propose a technique called IMD, which shows the differences between two ML models by constructing a novel representation called a \emph{Joint Surrogate Tree} or JST. A JST is composed of two conjoined decision tree surrogates that jointly approximate the two models, intuitively capturing similarities and differences between them.
It overcomes the drawbacks of the direct and separate surrogate approaches mentioned in Section~\ref{sec:prob}: it avoids the non-smoothness of direct difference modelling, aligns and promotes similarity between surrogates for the two models, and shows differences in the context of each model's decision logic. Our method has a single hyperparameter, tree depth, which controls the trade-off between accuracy and interpretability. 

IMD performs two steps as shown in Figure~\ref{fig:method}. In the first step, IMD builds a JST for models $M_1, M_2$ using data samples $X$, and then extracts diff regions from the JST. Interpretability is ensured by restricting the height of the JST. The IMD algorithm treats $M_1, M_2$ as black boxes and can handle any pair of classification models. It is also easy to implement as it requires a simple modification to popular greedy decision tree algorithms. 

The second (optional) step, discussed at the end of  Section~\ref{sec:algo:refine}, refines the JST by identifying diff regions where the two decision tree surrogates within the JST differ but the original models do not agree with the surrogates on their predictions. The refinement process aims to increase the fidelity of the surrogates in the diff regions, thereby generating more precise diff regions where the true models also differ.  

\subsection{Joint Surrogate Tree}
\label{section:joint-surrogate-tree}


\paragraph{Representation}

\begin{figure}[t]
    % \centering
\resizebox{\columnwidth}{!}{

\begin{tikzpicture}
   \definecolor{m1}{HTML} {ffb6c1ff};
   \definecolor{m2}{HTML}{ffa07aff};
   \definecolor{l1}{HTML}{e0ffffff};
   \definecolor{l0}{HTML}{faebd7ff};
   \node[anchor=south west,inner sep=0] (jst) at (0,0) {\includegraphics[width=\textwidth]{images/t_simpler.pdf}};
   \begin{scope}[x={(jst.south east)},y={(jst.north west)}]
        \node[ellipse,draw,scale=1.3] (e) at (0.1,0.9) {Common};
        \node[ellipse,draw=black,fill=m1,scale=1.3](m1) at (0.1, 0.8) {$M_1$ - LR};
        \node[ellipse,draw=black,fill=m2,scale=1.3](m2) at (0.1, 0.7) {$M_2$ - RF};
        \node[circle,draw=black,dashed,align=center] at (0.1, 0.55) {Diverging\\ nodes};
        \node[rectangle,draw=black,fill=l1,scale=1.35] at (0.25,0.9){Label 1};
        \node[rectangle,draw=black,fill=l0,scale=1.35] at (0.25,0.8){Label 0};
        
        %\draw[help lines,xstep=.1,ystep=.1] (0,0) grid (1,1);
        %\foreach \x in {0,1,...,9} { \node [anchor=north] at (\x/10,0) {0.\x}; }
        %\foreach \y in {0,1,...,9} { \node [anchor=east] at (0,\y/10) {0.\y}; }
    \end{scope}
\end{tikzpicture}

        }
\caption{A JST for the Breast Cancer (\textit{bc}) dataset.}
\label{fig:jst}
\end{figure}

%Our key model comparison output JST is shown in 
Figure~\ref{fig:jst} shows an example of a JST for Logistic Regression and Random Forest models on the Breast cancer dataset~\citep{UCIRepo} (feature names are omitted to save space). 
%As mentioned, 
The JST consists of two conjoined decision tree surrogates for the two models. The white oval nodes of the JST are shared decision nodes where both surrogates use the same split conditions. We refer to the subtree consisting of white nodes as the common prefix tree.
In contrast, the colored nodes represent separate decision nodes, pink for surrogate $\hat{M}_1$ corresponding to $M_1$, and orange for surrogate $\hat{M}_2$ for $M_2$. The rectangular nodes correspond to the leaves, and are colored differently to represent class labels --- cyan for label 1, and beige for label 0. The leaves are marked as pure/impure depending on whether all the samples falling there have the same label or not.

The JST intuitively captures diff regions, i.e., local regions of feature space where the two input models diverge, and also groups them into a two-level hierarchy. As with any surrogate-based diff model, we have $\hat{D}(x) = 1$ if and only if the constituent decision tree surrogates disagree, $\hat{M}_1(x) \neq \hat{M}_2(x)$. 
Thus, diff regions can be identified by first focusing on an \emph{or-node} (the dotted circle nodes in Figure~\ref{fig:jst} where the surrogates diverge) and then enumerating pairs of leaves under it with different labels.

For example, considering the rightmost or-node $v_{o1}$ in Figure~\ref{fig:jst}, with path condition $X[22] \ge 116.05$, $\hat{M}_2$ classifies all the samples to label $0$ whereas $\hat{M}_1$ classifies to label $1$ in the region $X[22] < 118.85 \wedge X[29] < 0.1$. Therefore the diff region is $118.85 > X[22] \ge 116.05 \wedge X[29] < 0.1$. 
While in this case $v_{o1}$ yields a single diff region, in general multiple diff regions could be grouped under a single or-node, resulting in a hierarchy. 
By processing all the or-nodes of the JST, one obtains all diff-regions.  

Formally, $JST = (V=V_{dt} \cup V_o, E=E_{dt}\cup E_o)$. $V_{dt}$ is a set of decision nodes similar to decision trees (oval shaped in figure) with each outgoing edge $\in E_{dt}$ (solid arrows) representing True or False decisions as in a regular decision tree. $V_o$ are the set of or-nodes (circular nodes) representing the diverging points where the decision trees no longer share the same split conditions. Each child of $v_o\in V_o$ is denoted as $v_o^{i}$, $i = 1, 2$, with dashed edges $(v_o, v_o^{i}) \in E_o$.  Each $v_o^{i}$ represents the root of an individual surrogate decision sub-tree for model $i$. The height of a JST is the maximum number of decision edges (solid edges) in any root-to-leaf path. 

%The formal definition of the diff regions is as follows. 
Formally, a diff region is defined by the non-empty intersection of path-conditions of differently labelled leaves $l_1, l_2$ from two decision sub-trees rooted at the same or-node $v_o$. The collection of all diff regions specifies the diff ruleset:
%
\begin{align}
\mathcal{R} = &\left\{\pc(l_1)\wedge \pc(l_2) : l_i\in \leaves(v^i_o), \; i=1,2, \right.\nonumber\\
&\left. \quad \lab(l_1)\neq \lab(l_2), \; v_o\in V_o \right\}.\label{eq:diff-region}
\end{align}
%

\paragraph{Construction} 
The objective of JST construction is %therefore 
two-fold:
(a) \emph{Maximize comparability}: To achieve maximal sharing of split conditions between the two decision tree surrogates, and
(b) \emph{Interpretability}: Achieve the above objective under the constraint of interpretability. We have chosen the height of the JST as the interpretability constraint.

The construction of a JST corresponding to the inputs $M_1, M_2, X$ starts with evaluating  $y_1=M_1(X)$ and $y_2=M_2(X)$. Starting from the root, at each internal node $v \in V_{dt}$, with inputs $(X_v,y_{1v}=M_1(X_v),y_{2v}=M_2(X_v))$ filtered by the node's path condition, the key choice is whether to create a joint decision node or an or-node for the surrogates to \emph{diverge}. The choice of node type signifies whether the two surrogates will continue to share their split conditions or not. %Note that, 
Once divergence happens at an or-node, the two sub-trees rooted at the or-node %both child decision trees 
do not share any split nodes thereafter. 
Below we present a general condition for divergence and a simplified one implemented in our experiments. 

In general, a divergence condition should compare the cost of a joint split to that of separate splits for the two models. In the context of greedy decision tree algorithms considered in this work, the comparison is between the sum of impurities for the best possible common split,
%In the case of non-divergence, the following objective function returns the common split condition: 
%
%\begin{equation}
%\mathop{\mathrm{argmin}}_{\{f\in[d],\,t\in X_v[f]\}} H(f,t,X_v,y_{1v}) + H(f,t,X_v,y_{2v})
%\end{equation}
\begin{multline}\label{eq:impJoint}
    \imp(X_v, y_{1v}, y_{2v}) =\\ \min_{\{f\in[d],\,t\in X_v[f]\}} H(f,t,X_v,y_{1v}) + H(f,t,X_v,y_{2v}),
\end{multline}
and the impurities $\imp(X_v, y_{1v})$, $\imp(X_v, y_{2v})$ \eqref{eq:imp} for the best separate splits. One condition for divergence is 
\begin{equation}\label{eq:divCond1}
    \imp(X_v, y_{1v}) + \imp(X_v, y_{2v}) \leq \alpha \imp(X_v, y_{1v}, y_{2v})
\end{equation}
for some $\alpha \leq 1$. %$\alpha \in [0, 1]$. 
The choice $\alpha = 1$ always results in divergence and thus reduces to the separate surrogate approach in Section~\ref{sec:prob}. This happens because the left-hand side of \eqref{eq:divCond1} corresponds to separately minimizing the two terms in \eqref{eq:impJoint}, hence ensuring that \eqref{eq:divCond1} is true. As $\alpha$ decreases, joint splits are  favored. %For $\alpha = 0$, divergence occurs only when both impurities are zero, implying that all children of the two splits are pure. %The $imp$ value of zero implies that both the children are pure.
For $\alpha < 0$, divergence essentially never occurs.\footnote{If $\imp(X_v, y_{1v}, y_{2v}) = 0$, then $\imp(X_v, y_{1v}) = \imp(X_v, y_{2v}) = 0$ also and the same $(f, t)$ pair minimizes all three impurities. Hence divergence has no effect.}

%The second condition amounts to evaluating the impurity function for all splits corresponding to both $(X_v,y_{1v})$ and $(X_v,y_{2v})$ and  
For this work, we choose to heavily bias the algorithm toward joint splits and greater interpretability of the resulting JST. In this case, we use the simplified condition 
\begin{equation}\label{eq:divCond2}
\imp(X_v,y_{1v}) = 0  \vee \imp(X_v,y_{2v}) = 0,
\end{equation}
%\vspace{0.5cm}
which results in divergence if at least one of the minimum impurity values is zero. The advantage of \eqref{eq:divCond2} over \eqref{eq:divCond1} is that the minimization in \eqref{eq:impJoint} to compute $\imp(X_v, y_{1v}, y_{2v})$ can be done lazily, only if \eqref{eq:divCond2} is not satisfied. If condition \eqref{eq:divCond2} is met, we create an or-node, two or-edges, and grow individual surrogate trees from that point onward. Figure~\ref{fig:jst} shows 1 instance (node $v_{o0}$) where \eqref{eq:divCond2} is met. A special case of \eqref{eq:divCond2} occurs %A trivial condition for divergence is  
when at least one of $y_{1v}, y_{2v}$ contains only one label, i.e., it is already pure without splitting. The node $v_{o1}$ in Figure~\ref{fig:jst} shows one such case. 

The JST construction ends if pure leaf nodes are found or the height of the JST has reached a pre-defined hyper-parameter value $k$. 

\paragraph{JST Refinement}
\label{sec:algo:refine}

We now present an iterative process for refinement aimed at increasing precision of diff regions.% \eqref{eq:diff-region}. 

For each leaf $l_i$ contributing to a diff region \eqref{eq:diff-region}, if its samples (satisfying $\pc(l_i)$) have more than one label as given by the model $M_i$ being approximated (the leaf is impure), we can further split it into two leaf nodes. This refines the decision tree surrogates only in the diff regions and not at all impure leaves. Next, diff regions are recomputed with the resulting leaf nodes. This process can continue for a pre-defined number of steps or until some budget is met. Every such iteration increases the tree depth by $1$ (but not uniformly) and improves the fidelity of the individual sub-tree rooted at an or-node. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{EXPERIMENTAL RESULTS}
\label{sec:expt}

% @Swagatam
We report experimental results comparing the proposed IMD technique to learning separate surrogates for the two models (Section~\ref{sec:expt:sep}), and to direct difference modelling and the prior work of \citet{nair2021changed} (Section~\ref{sec:expt:alt}). The effect of refinement is demonstrated in Section~\ref{sec:expt:ref}. The following paragraphs describe the setup of the experiments.


\paragraph{Datasets} We have used 13 publicly available~\citep{UCIRepo,OpenML_as_a_whole,alcala2011keel} tabular classification datasets, including both binary and multiclass classification tasks. As preprocessing steps, we dropped duplicate instances occurring in the original data, and one-hot encoded categorical features.


\paragraph{Models} We split each dataset in the standard 70/30 ratio, and trained an array of machine learning models --- Decision Tree Classifier (DT), Random Forest Classifier (RF), K-Neighbours Classifier (KN), Logistic Regression (LR), Gradient Boosting (GB), Multi-Layered Perceptron (MLP), and Gaussian Naive Bayes (GNB). For some models, multiple instances were trained with different parameter values. We have used the Scikit-learn~\citep{scikit-learn} implementations for training. Once trained, we did not do any performance tuning of the models, and used them as black boxes (through the $\tt{predict()}$ method only) for subsequent analyses.
The dataset and model details including test set accuracies are reported in the supplementary material (SM). %~\ref{appdx: bench} and ~\ref{appdx: models}.



\paragraph{Set Up} We have selected two pairs of models per dataset corresponding to the largest and smallest differences in accuracy on the test set (indicated as \textit{max $M_1$-$M_2$} and \textit{min $M_1$-$M_2$} in Table~\ref{tab:ablations-with-deltas-trimmed-for-uai}). This ensures we compare models with contrasting
predictive performance, as well as models that achieve similar accuracy. For fitting and evaluating diff models, including our IMD approach as well as baselines, we split the available dataset $X$ (without labels) in a 70/30 ratio into $\Dtrain$ and $\Dtest$. This split is not and does not have to be the same as the train/test splits for training and evaluating the underlying models.
We perform 5 train/test splits and report in the main paper the mean of the following metrics across the 5 runs, with standard deviation values in the SM.



\paragraph{Metrics} To measure how accurately we capture the true regions of disagreement between models $M_1$ and $M_2$, %using the rules, 
we use the following metrics. Given a test set $\Dtest$, we have a subset of \emph{true diff samples}:
\[ \Ttrue = \{ x \in \Dtest \,|\, M_1(x) \neq M_2(x)\}, \]
and the predicted diff samples by the diff model $\hat{D}(x)$:
\[ \Tpred = \{ x \in \Dtest \,|\, \hat{D}(x) = 1
\}.
\]
Recall that in the case where we have extracted a diff ruleset $\mathcal{R}$ for $\hat{D}(x)$, $x \in \Tpred$ if there exists a rule $r \in \mathcal{R}$ that is satisfied by $x$.

\noindent{\bf Precision (Pr)} is the ratio $\frac{| \Ttrue\, \cap \, \Tpred|}{|\Tpred|}$, measuring the fraction of predicted diff samples that are true diff samples on the test set $\Dtest$.

\noindent{\bf Recall (Re)} is the ratio $\frac{|\Ttrue \, \cap \, \Tpred|}{|\Ttrue|}$, measuring the fraction of true diff samples in $\Dtest$ that are correctly predicted.

\noindent{\bf Interpretability} For interpretable diff models for which we have extracted a diff ruleset  $\mathcal{R}$, we measure its interpretability in terms of the number of rules \textbf{(\# r)} in the set, and the number of unique predicates \textbf{(\# p)} summed over all the rules in the set. The choice of the above metrics is motivated by the works of~\citet{lakkaraju2016interpretable,dash2018boolean,letham2015interpretable}.


\subsection{IMD against Separate Surrogates}
\label{sec:expt:sep}

% 
% WITHOUT PREDICATES COLUMN and ADDED DELTA FOR PR/RE/RULES
% 

\begin{table*}[t]
\centering
\caption{Sep. surrogates shows slightly higher recall, but IMD shows comparable performance with much less complexity.
}

\begin{tabular}{cc}
% Table & Same Table \\

        % Table 1

        \begin{tabular}{llcccrccr}
        \toprule
                  &            &     &  \multicolumn{3}{c}{\textbf{Separate Surrogates}} & \multicolumn{3}{c}{\textbf{IMD}} \\
        \cmidrule(r){4-6} \cmidrule(l){7-9}
        \textbf{Dataset}        &  \textbf{$M_1$ vs.~$M_2$} & \textbf{diffs}   &          Pr &      Re &      \#r & Pr &      Re &    \#r \\
        \hline
        \multirow{2}{*}{adult} & max MLP1-GB & 0.20  &           0.96 &  0.88 &    70.0 &             0.96 &  0.88 &  18.0 \\
          & min MLP2-DT2 & 0.08  &           0.45 &  0.29 &   155.4 &             0.46 &  0.16 &  17.4 \\
        \cline{1-9}
        \multirow{2}{*}{bankm} & max MLP2-GB &0.26  &           0.66 &  0.75 &   263.6 &             0.70 &  0.67 &  23.0 \\
                  & min MLP1-GNB & 0.26 &           0.74 &  0.75 &   345.0 &             0.71 &  0.69 &  34.4 \\
        % \cline{1-9}
        % \multirow{2}{*}{banknote} & max KN1-GNB & 0.15  &          0.88 &  0.89 &    32.2 &             0.90 &  0.88 &  13.4 \\
                  % & min LR-DT1 &  0.03  &          0.53 &  0.60 &    30.8 &             0.64 &  0.47 &   7.2 \\
        % \cline{1-9}
        % --\multirow{2}{*}{bc} & max DT1-GNB & 0.05 &           0.38 &  0.46 &    39.2 &             0.44 &  0.40 &   9.6 \\
                  % & min KN2-RF2 & 0.07  &           0.37 &  0.38 &    49.0 &             0.30 &  0.24 &  10.8 \\
        % \cline{1-9}
        % \multirow{2}{*}{diabetes} & max MLP2-GB & 0.22  &           0.42 &  0.45 &   215.8 &             0.40 &  0.28 &  24.2 \\
                  % & min RF1-GNB & 0.18  &           0.39 &  0.43 &   156.0 &             0.31 &  0.34 &  20.8 \\
        \cline{1-9}
        \multirow{2}{*}{eye} & max RF1-GNB &  0.56  &           0.65 &  0.66 &  1054.0 &             0.60 &  0.71 &  36.2 \\
                  & min LR-MLP1 & 0.34  &           0.59 &  0.53 &   781.6 &             0.57 &  0.39 &  28.4 \\
        \cline{1-9}
        \multirow{2}{*}{heloc} & max KN1-RF2 & 0.23 &           0.40 &  0.23 &   373.0 &             0.40 &  0.13 &  15.8 \\
                  & min GB-RF1 & 0.17   &           0.30 &  0.19 &   234.4 &             0.25 &  0.06 &  14.6 \\
        \cline{1-9}
        \multirow{2}{*}{magic} & max RF1-GNB & 0.25 &           0.75 &  0.58 &   362.8 &             0.75 &  0.52 &  25.0 \\
                  & min MLP2-DT2 & 0.11 &           0.43 &  0.36 &   282.6 &             0.42 &  0.17 &  11.0 \\
        % \cline{1-9}
        % \multirow{2}{*}{mushroom} & max KN1-GNB & 0.03  &           0.94 &  0.70 &    28.0 &             0.81 &  0.70 &   5.0 \\
                  % & min RF2-GNB & 0.03  &           0.93 &  0.70 &    42.0 &             0.74 &  0.71 &   8.6 \\
        \cline{1-9}
        \multirow{2}{*}{redwine} & max RF1-KN2 &  0.37  &           0.46 &  0.52 &   627.8 &             0.52 &  0.25 &  29.0 \\
                  & min KN1-GNB & 0.52  &           0.70 &  0.59 &   563.6 &             0.69 &  0.47 &  40.4 \\
        \cline{1-9}
        \multirow{2}{*}{tictactoe} & max LR-GNB & 0.34  &          0.76 &  0.78 &   109.6 &             0.76 &  0.89 &  24.4 \\
                  & min DT2-KN2 & 0.06  &           0.10 &  0.15 &    54.0 &             0.16 &  0.11 &   5.8 \\
        \cline{1-9}
        \multirow{2}{*}{waveform} & max LR-DT1 & 0.18   &           0.45 &  0.52 &   746.0 &             0.49 &  0.27 &  33.2 \\
                  & min MLP1-RF2 & 0.11 &           0.17 &  0.32 &   725.0 &             0.10 &  0.02 &   9.0 \\
        % \cline{1-9}
        % \multirow{2}{*}{whitewine} & max RF1-GNB & 0.53 &          0.64 &  0.59 &   847.2 &             0.63 &  0.56 &  42.6 \\
                  % & min LR-KN2 &  0.48  &           0.56 &  0.33 &   580.0 &             0.55 &  0.35 &  36.6 \\
        \bottomrule
        \end{tabular} &
        
        % Table 2
        
        \begin{tabular}{rrr}
        \toprule
        \multicolumn{3}{c}{\textbf{Sep. $-$ IMD}}   \\
        \cmidrule{1-3}
        $\Delta{\textrm{Pr}}$ &     $\Delta{\textrm{Re}}$ & $\Delta{\textrm{\#r}}$ \\
        
        \hline
        
        
            $-$0.00 & $-$0.00 &   $-$52.0 \\
             $+$0.01 & $-$0.13 &  $-$138.0 \\
             \cline{1-3}
             
             $+$0.04 & $-$0.08 &  $-$240.6 \\
            $-$0.03 & $-$0.06 &  $-$310.6 \\
            \cline{1-3}
            
            %  $+$0.01 & $-$0.01 &   $-$18.8 \\
            %  $+$0.11 & $-$0.13 &   $-$23.6 \\
            %  \cline{1-3}
        
            %  $+$0.06 & $-$0.05 &   $-$29.6 \\
            % $-$0.07 & $-$0.14 &   $-$38.2 \\
            % \cline{1-3}
        
            % $-$0.01 & $-$0.17 &  $-$191.6 \\
            % $-$0.08 & $-$0.09 &  $-$135.2 \\
            % \cline{1-3}
        
            $-$0.06 & $+$0.05 & $-$1017.8 \\
            $-$0.02 & $-$0.14 &  $-$753.2 \\
            \cline{1-3}
        
             $+$0.00 & $-$0.10 &  $-$357.2 \\
            $-$0.05 & $-$0.13 &  $-$219.8 \\
            \cline{1-3}
        
             $+$0.00 & $-$0.06 &  $-$337.8 \\
            $-$0.01 & $-$0.18 &  $-$271.6 \\
            \cline{1-3}
        
            % $-$0.13 & $-$0.01 &   $-$23.0 \\
            % $-$0.18 & $+$0.00 &   $-$33.4 \\
            % \cline{1-3}
        
             $+$0.06 & $-$0.27 &  $-$598.8 \\
            $-$0.01 & $-$0.11 &  $-$523.2 \\
            \cline{1-3}
        
            $-$0.00 & $+$0.11 &   $-$85.2 \\
             $+$0.05 & $-$0.04 &   $-$48.2 \\
             \cline{1-3}
        
             $+$0.04 & $-$0.25 &  $-$712.8 \\
            $-$0.07 & $-$0.30 &  $-$716.0 \\
            % \cline{1-3}
        
            % $-$0.01 & $-$0.03 &  $-$804.6 \\
            % $-$0.01 & $+$0.03 &  $-$543.4 \\
        
        
        
        \bottomrule
        \end{tabular} \\
\end{tabular}

\label{tab:ablations-with-deltas-trimmed-for-uai}
\end{table*}


First we study the effect of jointly training surrogates in IMD, which encourages sharing of split nodes, against training separate surrogates for the two models. Since these are both surrogate-based approaches to obtain a diff model $\hat{D}$, we compare the metrics for the \textit{diff rulesets} extracted (as described in Section~\ref{sec:prob}) from the surrogates. IMD extracts diff rulesets from JSTs, while the separate surrogate approach is a special case of IMD corresponding to $\alpha=1$. The height (a.k.a.~maximum depth) of the surrogates is restricted to 6 for both of the approaches. We do not perform the refinement step here as we study it in Section~\ref{sec:expt:ref}.

\noindent{\bf Observations } The metrics are reported for 8 datasets in Table~\ref{tab:ablations-with-deltas-trimmed-for-uai} (full version in Appendix). The differences in Pr, Re, and \# rules are also tabulated for better readability. We also report the fraction of diff samples in $\Dtest$ for each dataset and model pair combination in the ``diffs" column. This value is also the precision of a trivial diff-model ($\hat{D}(x)=1\,\, \forall x$, recall$=1.0$), or any diff-model that predicts \emph{diff} with probability $q$ (recall$=q$), e.g., $q=0.5$ is a random guesser. Clearly, diff prediction quality for both approaches is significantly better than random guessing.


To summarize the table, below we compare the approaches on the basis of average percentage increase or decrease in precision and recall (on going from separate to IMD) across all datasets. We also perform Wilcoxon's signed rank test (as recommended by \citet{benavoli2016should}) to verify the statistical significance of the observed differences.

For precision, we observe a very small drop ($1.55$\% on average) going from separate surrogates to IMD. Wilcoxon's test's $p$-value is $0.269$, implying no significant difference (at level $0.05$) between the approaches. For recall, we observe that IMD has $23.45$\% poorer recall. Wilcoxon's test confirms this difference with a $p$-value of $0.0002$, and a sign test also shows that separate surrogates have higher values of recall for 22 of the 26 benchmarks.

For the interpretability metrics however, IMD is the clear winner looking at the columns corresponding to numbers of rules (\# r) and unique predicates (\# p, in Appendix). If we simply average the numbers of rules and predicates to understand the scale of the difference (with the caveat that different datasets and model pairs have different complexities), the average number of rules for separate and IMD are 337.25 and 20.94, and the average numbers of predicates are 135.41 and 56.10. The corresponding $p$-values are also very low (on the order of $10^{-6}$).

%\noindent{\bf Conclusion } The difference model $\hat{D}$ obtained from separately trained surrogates is somewhat more accurate as evidenced by higher recall, %, and thus F1-score), but it comes with a huge expense in interpretability. The results also affirm the effect of sharing nodes (promoting similarity between surrogates) as much as possible in JST, which localizes differences before divergence.







%\subsection{Comparison among alternate approaches}
\subsection{Comparison with Other Approaches}
\label{sec:expt:alt}

\begin{table}[t]
% \scriptsize
% \small
\centering
\caption{Comparison of F1-scores. The mean ranks ($\downarrow$ the better) highlight that sep. surr., and Direct GB are most accurate, but IMD is close with greater interpretability.}
\begin{tabular}{lccccc}
\toprule
% \hline
          &   & \textbf{Sep.} & \textbf{Direct} & \textbf{Direct} & \textbf{BRCG} \\
          
\textbf{Dataset} & \textbf{IMD} &  \textbf{Surr.} &      \textbf{DT} &        \textbf{GB} &             \textbf{Diff.} \\

% \midrule
\hline
\multirow{2}{*}{adult} &             0.92 &           0.92 &     0.92 &     0.98 &          0.33 \\
          &              0.23 &           0.34 &     0.17 &     0.61 &          0.31 \\
\cline{1-6}
\multirow{2}{*}{bankm} &             0.68 &           0.70 &     0.69 &     0.77 &          0.41 \\
          &              0.70 &           0.75 &     0.68 &     0.82 &          0.41 \\
\cline{1-6}
\multirow{2}{*}{banknote} &             0.89 &           0.89 &     0.83 &     0.94 &          0.27 \\
          &             0.52 &           0.56 &     0.57 &     0.63 &          0.06 \\
\cline{1-6}
\multirow{2}{*}{bc} &             0.39 &           0.41 &     0.17 &     0.00 &          0.10 \\
          &             0.25 &           0.37 &     0.28 &     0.19 &          0.13 \\
\cline{1-6}
\multirow{2}{*}{diabetes} &             0.32 &           0.43 &     0.21 &     0.35 &          0.35 \\
          &             0.32 &           0.41 &     0.09 &     0.22 &          0.30 \\
\cline{1-6}
\multirow{2}{*}{heloc} &             0.19 &           0.29 &     0.03 &     0.14 &          0.37 \\
          &             0.10 &           0.22 &     0.02 &     0.05 &          0.27 \\
\cline{1-6}
\multirow{2}{*}{magic} &             0.62 &           0.65 &     0.63 &     0.78 &          0.40 \\
          &             0.24 &           0.39 &     0.14 &     0.27 &          0.20 \\
\cline{1-6}
\multirow{2}{*}{mushroom} &             0.75 &           0.80 &     0.81 &     0.97 &          0.76 \\
          &             0.72 &           0.80 &     0.81 &     0.97 &          0.74 \\
\cline{1-6}
\multirow{2}{*}{tictactoe} &             0.82 &           0.77 &     0.77 &     0.82 &          0.83 \\
          &             0.12 &           0.12 &     0.00 &     0.09 &          0.00 \\
% \bottomrule
\toprule
\textit{mean rank} &   \textbf{3.278} & \textbf{2.056} &  \textbf{3.694} & \textbf{2.278} & \textbf{3.694} \\

\bottomrule
\end{tabular}

\label{tab:approaches}
\end{table}

In this experiment, we compare the quality of prediction of the true dissimilarity $D$ %the binary \emph{diff-model}
with respect to other baselines. 
The first two baselines are direct approaches (introduced in Section~\ref{sec:prob}) as they relabel the instances as \emph{diff}(``1") or \emph{non-diff}(``0") and directly fit a classification model on the relabeled instances.
Out of a huge number of possible models for this binary classification problem of predicting \textit{diff} or \textit{non-diff}, we choose Decision Tree (with \texttt{max\_depth=6}) to be directly comparable to JST, and Gradient Boosting Classifier (\texttt{max\_depth=6}, rest default settings in Scikit-learn) to provide a more expressive but uninterpretable benchmark.
% We have chosen two direct diff models: Decision Tree (maximum depth restricted to 6) and Gradient Boosting Classifier (maximum depth of 6, other parameters at the default configurations in Scikit-learn) --- which is more expressive, but uninterpretable.
These choices are made to compare the quality of surrogate-based diff regions against directly modelled diff regions, %when the diff-samples are modeled directly, 
and also to understand if we are significantly compromising on  quality by not using a more expressive or uninterpretable model.
As a third baseline, we compare to diff rulesets obtained from Grounded BRCG~\citep{nair2021changed} ruleset surrogates for the two models. 
The surrogate-based approaches from the previous subsection, IMD (without refinement) and separate, are also included for completeness.

\noindent{\bf Observations } We have listed the F1-scores (harmonic mean of precision and recall) in Table~\ref{tab:approaches}, and omitted the $M_1$ vs.~$M_2$ column (same as in Table~\ref{tab:ablations-with-deltas-trimmed-for-uai}) for brevity. Since \textit{BRCG Diff.} applies only to binary classification tasks, we only show it for those.
Note that for IMD and separate surrogates, the precision and recall values are already reported in Table~\ref{tab:ablations-with-deltas-trimmed-for-uai}. For the other methods and datasets, precision, recall, and \# rules (if applicable) are in the Appendix. On average, we observe that IMD achieves a $89.76$\% improvement in F1-score over Direct DT, and $98.52$\% improvement over the BRCG Diff.~approach.\footnote{This is computed by removing the second subrow for tictactoe as F1 score is 0 for both Direct DT and BRCG Diff.~and the jump is infinite. This removal is thus favorable toward them.} On the other hand, we do not observe a large drop in F1-score from the uninterpretable Direct GB to IMD ($-5.87\%$). %\footnote{This %average change 
% is computed by removing the first row for the \textit{bc} dataset as the F1-score for Direct GB is very low ($0.03$) and the relative increase (to $0.34$) is high. It is thus favorable toward Direct GB.} 
Similarly, the precision and recall differences in Section~\ref{sec:expt:sep} combine to give a $-15.26$\% decrease in going from separate surrogates to IMD. 

We report mean ranks in Table~\ref{tab:approaches} and performed Friedman's test following \citet{demsar2006statistical}, which confirms significant differences between the methods with a $p$-value on the order of $0.0006$. Next we perform pairwise comparisons of IMD against the other approaches. 
The $p$-values from Wilcoxon's signed rank test are $0.00025$, $0.043$, $0.043$, and $0.1594$ %for comparisons
against separate, \textit{BRCG Diff.}, \textit{Direct DT}, and \textit{Direct GB} respectively. We pit these against the Holm-corrected thresholds of $0.0125$, $0.017$, $0.025$, $0.05$, and observe that %all of the differences are significant. We emphasize that although separate and direct GB have consistently higher F1-scores than IMD, the size of the differences is small and IMD is considerably more interpretable.
only the first one (IMD vs. separate) is significant for this set of values. However, we emphasize that although separate and Direct GB have consistently higher F1-scores than IMD, the size of the differences is small and IMD is considerably more interpretable.
% \textcolor{blue}{
For the interpretable methods, the average numbers of rules observed for IMD, Direct DT, and BRCG Diff.~are $16.05$, $10.50$, $37.69$ (separate surrogates was already discussed in Section~\ref{sec:expt:sep}).
% }

% \textcolor{blue}{
We present further experiments (in Appendix) varying the depth to understand the accuracy-complexity trade-off for Direct DT, Separate and IMD extensively. While the trade-offs for Direct DT and IMD are competitive, both of them are consistently better than Separate. We also discuss qualitative comparison between Direct DT and IMD which brings out the benefit of IMD in placing the diff rules in the context of the models' decision logic, as already seen in Figure~\ref{fig:jst}.
% }

% \textcolor{red}{Further experiments (in Appendix) on varying depth also corroborate that Direct DT achieves lower F1 scores than IMD on most benchmarks.}

% \textcolor{red}{For the interpretable methods, the average numbers of rules observed for IMD, Direct DT, and BRCG Diff.~are $16.05$, $10.50$, $37.69$ (separate surrogates was already discussed in Section~\ref{sec:expt:sep}). While these numbers are all comparable, the benefit of IMD is that it places the diff rules in the context of the models' decision logic, as seen in Figure~\ref{fig:jst}.}



% We have also seen that BRCG Diff. approach typically achieved higher Recall, but very low Precision scores (not shown in Table~\ref{tab:approaches}) that ultimately brings down its F1.









\subsection{Effect of Refinement}
\label{sec:expt:ref}

\begin{table}
% \small
\centering
\caption{Precision improves on refinement ($\text{IMD}_\text{6+1}$).}

\begin{tabular}{lccc}
\toprule
% \hline
    \textbf{Dataset}     & $\text{\bf IMD}_\text{\bf6}$ & $\text{\bf IMD}_\text{\bf6+1}$ & $\text{\bf IMD}_\text{\bf7}$ \\

\midrule

\multirow{2}{*}{adult} &             0.96 &                   \textbf{0.96} &             0.95 \\
          &              0.46 &                   \textbf{0.59} &             0.53 \\
\cline{1-4}


\multirow{2}{*}{bankm} &            0.70 &                   \textbf{0.78} &             0.77 \\
          &             0.71 &                   \textbf{0.79} &             0.74 \\
\cline{1-4}


% \multirow{2}{*}{banknote}  &             0.90 &                   \textbf{0.90} &             0.89 \\
%           &             0.64 &                   0.67 &             \textbf{0.76} \\
% \cline{1-4}


% \multirow{2}{*}{bc}  &             0.44 &       0.44 &             0.44 \\
%           &             \textbf{0.30} &      0.28 &             0.26 \\
% \cline{1-4}


% \multirow{2}{*}{diabetes} &             0.40 &  \textbf{0.47} &             0.44 \\
%           &             0.31 &                   0.33 &             0.34 \\
% \cline{1-4}



\multirow{2}{*}{eye}  &             0.60 & \textbf{0.67} &             0.62 \\
          &             0.57 &                   \textbf{0.64} &             0.57 \\
\cline{1-4}


\multirow{2}{*}{heloc} &             0.40 &     \textbf{0.45} &             0.42 \\
          &             0.25 &                   0.25 &             \textbf{0.26} \\
\cline{1-4}



\multirow{2}{*}{magic} &             0.75 &     \textbf{0.80} &    0.73 \\
          &             0.42 &      \textbf{0.55} & 0.46 \\
\cline{1-4}



% \multirow{2}{*}{mushroom} & 0.81 &      \textbf{0.95} &             0.88 \\
          % &             0.74 &  \textbf{0.94} & 0.89 \\
% \cline{1-4}


\multirow{2}{*}{redwine} &  0.52 &  \textbf{0.56} &             0.48 \\
          &             0.69 &      \textbf{0.73} &             0.68 \\
\cline{1-4}


\multirow{2}{*}{tictactoe} &    0.76 &      \textbf{0.79} &             0.78 \\
          &             0.16 &  \textbf{0.19} & 0.18 \\
\cline{1-4}


\multirow{2}{*}{waveform} &     0.49 &  \textbf{0.54} &             0.49 \\
          &             0.10 &      0.14 &     \textbf{0.17} \\
% \cline{1-4}



% \multirow{2}{*}{whitewine} &    0.63 &      \textbf{0.67} &  0.64 \\
%           &             0.55 &      \textbf{0.59} &             0.59 \\
\bottomrule
\end{tabular}

% \caption{Improvement of precision on refinement ($\text{IMD}_\text{6+1}$).}
\label{tab:refinement}
\end{table}
To investigate the effect of the refinement step of IMD (described at the end of Section~\ref{sec:algo:refine}), we compare diff rulesets obtained from three variations of the algorithm --- IMD with maximum depth of 6 ($\text{IMD}_6$), same as in previous experiments; $\text{IMD}_6$ with 1 iteration of refinement ($\text{IMD}_{6+1}$); and IMD with maximum depth of 7 ($\text{IMD}_7$).

Looking at Table~\ref{tab:refinement} (all benchmarks not shown for lack of space), we observe improvement in precision from $\text{IMD}_6$ to $\text{IMD}_{6+1}$ ($11.27$\% on average), and interestingly, also from $\text{IMD}_7$ to $\text{IMD}_{6+1}$ ($4.22$\% on average). The $p$-values from Wilcoxon's test are on the order of $10^{-3}$ for both comparisons, validating the significance of the improvement. The average numbers of rules for the three approaches are 20.93, 28.77, and 41.01 respectively, confirming that $\text{IMD}_{6+1}$ only refines selectively compared to $\text{IMD}_{7}$.

The results demonstrate that selective splitting of impure leaf nodes only in predicted diff regions ($\text{IMD}_{6+1}$), %where the surrogates predict different labels, 
improves precision %(of diff rules) 
compared to regular tree splitting of \textit{all} impure nodes ($\text{IMD}_{7}$). However, this improvement is to be taken with some caution as it comes at the cost of a consistent drop in recall ($15.37$\% from $\text{IMD}_6$ and $25.14$\% from $\text{IMD}_7$ averaged across all benchmarks). Thus we recommend refinement specifically for scenarios requiring high precision difference modelling.


\paragraph{Experimental Conclusions} IMD has close to the same F1-scores as the top methods in our comparison, separate surrogates and the (uninterpretable) Direct GB. At the same time, IMD yields much more concise results, with an order of magnitude fewer diff rules than separate surrogates. This affirms the benefit of sharing nodes in JST, which localizes differences before divergence.
% \textcolor{blue}{
We also see (in SM) how the features deemed important by JST are close to what the models also use in their decision logic via feature importance computations. This establishes our claim that JSTs are able to achieve two things at once: interpretable surrogates that can be compared easily for the two models.
% }
% \textcolor{red}{IMD has better F1-scores than Direct DT and BRCG Diff.~while having a comparable number of diff rules and a more unified JST representation.} 
Refinement further improves the precision of IMD, but at the cost of recall and interpretability. Additional experiments (in SM) %that study the effect of depth on the metrics among the baselines 
also support these conclusions.



\subsection{Case Study}
\label{sec:expt:casestudy}

We conclude by demonstrating a practical application of the method in the fairness area in the advertising domain. Bias in ad campaigns leads to poor outcomes for companies not reaching the right audience, and for customers who are incorrectly targeted. Bias mitigation aims to correct this by changing models to have more equitable outcomes.

Our IMD method can be used to assess the impact of bias mitigation on a model. In this case study, a bias mitigation method was applied to the %privileged 
group of \emph{non}-homeowners who had higher predicted rates of conversion (relative to ground truth). The root node of the JST captures this group. Figure \ref{fig:ad:subtree} shows a part of the JST (full tree in the Appendix). Although the non-homeowner group is already over-predicted, the JST shows that for certain cohorts within the %privileged 
group (those outside the ages of 25-34), conversions are predicted where the model before mitigation would not have. Interpretable model differencing here captures unintended consequences of model alterations. 


\begin{figure}[ht]
    \centering
    \includegraphics[width=0.5\columnwidth]{images/sub_graph_priv.png}
    \caption{A subtree of the JST showing an unintended increase in predicted conversions after bias mitigation for a cohort of the already over-predicted group of non-homeowners.}
    \label{fig:ad:subtree}
\end{figure}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{CONCLUSION}
\label{sec:concl}

We addressed the problem of interpretable model differencing, localizing and representing differences between ML models for the same task. We proposed JST to provide a unified view of the similarities and dissimilarities between the models as well as a succinct ruleset representation.
Experimental results indicate that the proposed IMD approach yields a favorable trade-off between accuracy and interpretability in predicting model differences.%, close to the accuracy of much more complex diff models while significantly more accurate than interpretable baselines.

The current work is limited to comparing classifiers in terms of $0$-$1$ dissimilarity. Since IMD is based on decision trees, its interpretability is limited to domains where the features are interpretable. While we have chosen to extend greedy decision tree algorithms due to ease and scalability, the resulting JSTs accordingly have no guarantees of optimality.

Future work could seek to address the above limitations. To extend the framework to regression tasks, a potential avenue is to threshold the difference function $D(M_1(x), M_2(x))$ and apply the classification framework presented herein. The problem of interpretable model differencing for images and language remains open. The constituent features for these modalities are generally not interpretable making the diff rulesets uninterpretable without additional considerations.  
  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Acknowledgements}
This work was partially funded by the European Union’s Horizon Europe research and innovation programme under grant agreement no. 101070568.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% References
% \clearpage % start references from a new page
\balance % balance the 2 column references last page
\bibliography{haldar_679}

% ----- supplementary -----
% \appendix
% \onecolumn
% \input{aaai_appendix}

\end{document}