% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


\newif\ifsupp
\supptrue

\ifsupp
\title{Improved Feature Importance Computation for Tree Models\\Based on the Banzhaf Value (Supplementary material)}
\else
\title{Improved Feature Importance Computation for Tree Models\\Based on the Banzhaf Value}
\fi

\author[1,2]{\href{mailto:a.karczmarz@mimuw.edu.pl}{Adam Karczmarz}{}}
\author[1,2]{\href{mailto:tpm@mimuw.edu.pl}{Tomasz Michalak}{}}
\author[1,2]{\href{mailto:anish@mimuw.edu.pl}{Anish Mukherjee}{}}
\author[1,2,3]{\href{mailto:sank@mimuw.edu.pl}{Piotr Sankowski}{}}
\author[1,3]{\href{mailto:wygos@mimuw.edu.pl}{Piotr Wygocki}{}}

\affil[1]{%
    Institute of Informatics\\
    University of Warsaw\\
    Poland
}
\affil[2]{%
    IDEAS NCBR\\
    Warsaw, Poland
}
\affil[3]{%
    MIM Solutions\\
    Warsaw, Poland
  }


\usepackage{amsthm}
\usepackage{thm-restate}
\usepackage{amssymb}
\usepackage{algorithm}
\usepackage{algcompatible}
\usepackage[noend]{algpseudocode}
\usepackage{subcaption}


\newtheorem{lemma}{Lemma}
\newtheorem{observation}{Observation}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}

\newcommand{\bos}{\texttt{BOSTON}}
\newcommand{\nh}{\texttt{NHANES}}
\newcommand{\fl}{\texttt{FLIGHTS}}
\newcommand{\hi}{\texttt{VEHICLE\_INSURANCE}}
\newcommand{\hishort}{\texttt{V.INS.}}
\newcommand{\synd}{\texttt{SYNTHETIC\_DENSE}}
\newcommand{\syns}{\texttt{SYNTHETIC\_SPARSE}}
\newcommand{\bosgb}{\texttt{BOSTON\_GB}}
\newcommand{\nhgb}{\texttt{NHANES\_GB}}
\newcommand{\flgb}{\texttt{FLIGHTS\_GB}}
\newcommand{\higb}{\texttt{VEHICLE\_INSURANCE\_GB}}
\newcommand{\bosdt}{\texttt{BOSTON\_DT}}
\newcommand{\nhdt}{\texttt{NHANES\_DT}}
\newcommand{\fldt}{\texttt{FLIGHTS\_DT}}
\newcommand{\hidt}{\texttt{VEHICLE\_INSURANCE\_DT}}
\newcommand{\shapours}{\texttt{shap\_orig\_a}}
\newcommand{\shapfast}{\texttt{shap\_fast}}
\newcommand{\shaporig}{\texttt{shap\_orig}}
\newcommand{\ban}{\texttt{BANZHAF}}


\newcommand{\tr}{\mathcal{T}}
\newcommand{\lvs}{\mathcal{L}}

\newcommand{\EX}{\mathbb{E}}


\begin{document}
\maketitle
\begin{abstract}

The Shapley value -- a fundamental game-theoretic solution concept -- has recently become one of the main tools used to explain predictions of tree ensemble models. Another well-known game-theoretic solution concept is the Banzhaf value. Although the Banzhaf value is closely related to the Shapley value, its properties w.r.t. feature attribution have not been understood equally well. This paper shows that, for tree ensemble models, the Banzhaf value offers some crucial advantages over the Shapley value while providing similar feature attributions.

In particular, we first give an optimal $O(TL + n)$ time algorithm for computing the Banzhaf value-based attribution of a tree ensemble model's output. Here, $T$ is the number of trees, $L$ is the maximum number of leaves in a tree, and $n$ is the number of features. In comparison, the state-of-the-art Shapley value-based algorithm runs in $O(TLD^2 + n)$ time, where $D$ denotes the maximum depth of a tree in the ensemble.
%
Next, we experimentally compare the Banzhaf and Shapley values for tree ensemble models.
Both methods deliver essentially the same average importance scores for the studied datasets using two different tree ensemble models (the sklearn implementation of Decision Trees or xgboost implementation of Gradient Boosting Decision Trees). However, our results indicate that, on top of being computable faster, the Banzhaf is more numerically robust than the Shapley value.

\end{abstract}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Tree ensembles are one of the most commonly used models for solving practical problems~\citep{friedman2001,kaggle_2017}. Tree ensembles are robust, easy to tune, and fast to train. They need small computational resources and support different types of data and missing values. Given this, tree ensembles are often the first choice model for tabular data.

One of the key research challenges regarding tree ensemble models (see Section~\ref{s:prelims} for a formal definition) and other machine learning techniques, in general, is explainability. When high-value decisions are taken, e.g., in medical diagnostic, understanding why a model made a specific prediction is often more important than the
prediction's accuracy.
Thus we need to develop methods to interpret the model's results in a transparent way so that humans are willing to follow model recommendations.
And indeed, a large body of previous work has been devoted to explaining tree models and their predictions, e.g.,~\citep{ChenG16, Breiman, Breiman2004, brophy2020trex, KuralenokEL19, Lundberg2020, Saabas}.

Feature attribution is one of the approaches to interpreting model predictions
that has been recently subject to a growing interest.
In this approach, each feature's impact, or importance, on the model's output $f(x)$ is quantified using a numerical
value, called the feature's \emph{local attribution} (e.g., \citep{Lundberg2017, SundararajanTY17}). Similarly, one can attempt to
quantify the individual features' overall impact on the model using \emph{global} attributions (e.g.,~\citep{CovertLL20, Lundberg2020}).

One of the  most popular approaches
to feature attribution uses methods originating from cooperative game theory that are called solution concepts or \textit{values}. They measure the importance of each player in, or contribution to, a coalitional game. While there exist many ways in which the importance of each player can be evaluated, some solution concepts are considered more fundamental than others due to underlying axiom systems that uniquely determine them. 
 One important game-theoretic solution concept that attracted a lot of attention in the context of explainability is \emph{the Shapley value}~(e.g.,~\citep{Lundberg2020,Lundberg2017,StrumbeljK14,SundararajanTY17}). To formally introduce this concept, let us denote by $\langle g,U\rangle$, a coalition game where $g:2^U\to\mathbb{R}$, $g(\emptyset)=0$,
 is the \textit{set function} that assigns utility to each coalition, and $U = \{1,\ldots,n\}$ is the set of players (or -- in our context -- features). Then, the Shapley value of the feature $i\in U$ is defined as follows~\citep{Shapley53}:
 \begin{equation}\label{eq:shap}
  \phi_i=\frac{1}{n}\sum_{S\subseteq U\setminus\{i\}} {\binom{n-1}{|S|}}^{-1}\left(g(S\cup\{i\})-g(S)\right).
\end{equation}
%
To operationalize this formula in our context, we further need to define function $g$ that extends model $f$ to all subsets of features $S\subseteq U$, i.e.,
$g$ allows us to drop features $U\setminus S$ of both the input $x$ and the model $f$.
There are multiple alternatives of how this can be done proposed in the literature~\citep{SundararajanN20,JanzingMB20}. 
In this paper, we focus on a popular approach by \citet{Lundberg2017} (see Section~\ref{s:prelims} for more details).
 Furthermore, it should be noted that, while the Shapley value has certain attractive properties, it is evident from the above formula that, in the general case, it requires the input of exponential size (i.e., function $g$). However,
in certain structured environments, when $g$ is of a convenient form or is limited in size, Shapley value can be computed in time polynomial in the number of players (features)~\citep{DengP94,GrecoLS15,michalak2013efficient,maafa2018algorithms},
e.g., for tree ensemble models~\citep{Lundberg2020}.

The Shapley value is not the only solution concept that has been advocated for interpreting model predictions. The \emph{Banzhaf value}~\citep{Banzhaf65} is the most well-studied alternative for Shapley value coming from the coalitional game theory and some papers indeed suggest using it for the purpose of interpreting model predictions~\citep{datta2,patel2020high,Sliwinski_Strobel_Zick_2019}.
This value, also well-known and axiomatized,
aggregates contributions of individual features differently:
\begin{equation}\label{eq:ban}
  \beta_i=\frac{1}{2^{n-1}}\sum_{S\subseteq U\setminus\{i\}}\left(g(S\cup \{i\})-g(S)\right).
\end{equation}
Mathematically, while the Shapley value is the weighted average of marginal contributions of players to coalitions, the Banzhaf value is a simple average.

Unfortunately, the difference between these two values when applied
to feature attribution has not been understood well in the literature.
We note that attributions based on Shapley value have been extensively studied experimentally~\citep{Lundberg2020,Lundberg2017,StrumbeljK14,SundararajanTY17}, whereas in the case of Banzhaf value, such studies have been done only on some basic
datasets~\citep{datta2,Sliwinski_Strobel_Zick_2019,patel2020high}. Moreover, despite very high similarity of both methods, to the best of our knowledge, no comparison between them has been done on real-world data-sets, e.g., \citet{patel2020high} compares on a single depth-3 tree, whereas~\citet{Patel2021GametheoreticVS} uses both methods for vocabulary selection in different NLP tasks without directly comparing these methods. For completeness, we review other explanation methods in
\ifsupp
Appendix~\ref{section:other}.
\else
the supplementary material.
\fi

The primary theoretical property that distinguishes the Shapley value from the Banzhaf value,
is that of so-called \emph{Efficiency}, that the individual importances
$\phi_i$ sum up to precisely $g(U)$.\footnote{The Shapley and Banzhaf values satisfy similar set of axioms, except for the Banzhaf value, the Efficiency axiom is replaced with so-called \emph{2-Efficiency} axiom.}
Several authors (e.g.,~\citep{AasJL21, SundararajanN20}) find a similar property desirable for attribution methods: that the attributions sum up precisely to the difference between the output
of the model and the baseline/mean prediction of the model. However, this does not always seem
crucial e.g., if we only want to compare impacts of individual features,
and is not guaranteed by other attribution methods used in practice, e.g., LIME~\citep{Ribeiro2016WhySI}. Furthermore, it is also possible to consider the normalized Banzhaf value that satisfies Efficiency~\cite{van1998axiomatizations}.

\paragraph{Our contribution.}
In this paper we partially fill the  gap by providing a comprehensive analysis of the Banzhaf value, including its comparison to the Shapley value, when applied to explainability of tree ensemble models. In particular, our contributions can be summarised as follows.

    We first show that, for tree ensemble models, when using the same natural set function $g$ as in~\citep{Lundberg2017, lundberg2018consistent, Lundberg2020}, Banzhaf value can be computed in linear time, noticeably faster than the Shapley value.
    Specifically, we
    develop an ${O(TL + n)}$ time algorithm for computing the Banzhaf value-based attribution of a tree ensemble model's output. Here, $T$ is the number of trees, $L$ is the maximum number of leaves in a tree, and $n$ is the number of features. 
In comparison, the state-of-the-art Shapley value-based algorithm by \citet{lundberg2018consistent, Lundberg2020} runs in $O(TLD^2 + n)$ time, where $D$ denotes the maximum depth of a tree in the ensemble. We note that recent
papers~\citep{Arenas_Barcelo_Bertossi_Monet_2021, Van_den_Broeck_Lykov_Schleich_Suciu_2021} do not improve this complexity\footnote{In fact, these papers only focus on proving polynomial time complexity, and neither bound nor optimize the degrees of the actual polynomials involved. Obtaining low-degree polynomial time algorithms is crucial from the practical point of view.}, but extend the method to more complex models instead.\footnote{Though, not always without loss of generality with respect to~\citep{lundberg2018consistent, Lundberg2020}. For example, decision trees captured by the class of boolean circuits studied in~\citep{Arenas_Barcelo_Bertossi_Monet_2021} seem to forbid
using a single feature for splitting multiple times  on a root-leaf path of a decision tree.}
    We stress that our algorithm is \emph{asymptotically optimal}, since even the description of a tree ensemble has size $\Theta(TL)$, and the output size is $\Theta(n)$.
    
    On the technical level, the algorithm of~\citet{lundberg2018consistent, Lundberg2020}, reduces computing $(\phi_i)_{i=1}^n$ to finding
    individual \emph{leaf contributions} to the attribution, one per each leaf/feature pair $(l,i)$ such
    that $i$ is used as a split feature in some ancestor of $l$.
    This goal is achieved using a top-down recursive algorithm
    whose running time is inherently $\Omega(TLD)$ (i.e., super-linear in the input size) simply because
    there can be $\Theta(TLD)$ such leaf/feature pairs.
    This bound still holds even when this approach is applied to computing a Banzhaf-value attribution.
    In our approach, leaf contributions are aggregated using
    a more efficient bottom-up dynamic programming approach,
    which requires only a \emph{linear} number of auxiliary values
    to be computed.
    
    In the experiments, our algorithm visibly outperforms all other algorithms, and can lead to considerable time savings when computing feature importances for decision tree-based models in practice. Moreover, we analytically prove that for trees of depths that commonly occur in practice, our algorithm
    for the Banzhaf value delivers numerically correct results. Similar arguments do not seem to be applicable to the most efficient algorithms computing Shapley value based attribution even for constant depth trees.
    
    We also perform an experimental comparison of the Banzhaf and Shapley values for tree ensemble models.
        For four studied real-world datasets and using two different approaches to training tree models, we verify experimentally that both methods deliver essentially the same average feature importance scores (called \emph{global impacts} in~\citep{Lundberg2020}) and very close attributions of individual predictions despite the differences in the sets of axioms the Banzhaf and the Shapley values satisfy.
        However, the Banzhaf value is more numerically robust than the Shapley value, and only very small errors are observed in the computations.
Overall, our analysis indicates that for tree models, the Banzhaf value has two important advantages over the Shapley value.
While both methods deliver comparable attributions, the Banzhaf value works faster and is less prone to numerical errors.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Preliminaries}\label{s:prelims}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Let $U:=\{1,\ldots,n\}$ be a set of \emph{features}.
Let $x$ be the input to the model
to be explained.
For $i\in U$, we write $x_i$ to refer to the \emph{value}
of the $i$-th feature in $x$.
More generally, for any subset $S\subseteq U$ we write
$x_S$ when referring to the vector $(x_i)_{i\in S}$.
We sometimes talk about random feature vectors, or consider
the values of individual features to be random variables.
We then write $X$ or $X_i$ respectively.
We write $X_S$ to denote the vector of random variables $(X_i)_{i\in S}$.
Let~$\bar{S}$ denote the complement $U\setminus S$.

\paragraph{Tree models.} Let $f:\mathbb{R}^U\to \mathbb{R}$ be the output function of the model to be explained. We focus on tree ensemble models $(\tr)_{i=1}^T$ where the output $f(x)$ of the model
is simply the average output $f_{\tr_i}(x)$ of its~$T$ individual trees.
Following \citet{Lundberg2020}, we assume the individual trees to have the number of leaves bounded by $L$ and depth bounded by $D$.\footnote{This is merely for clarity of the obtained time bounds. See discussion after Theorem~\ref{t:banzhaf}.}
Let us denote by $\rho_i$ the root of the tree $\tr_i$.

When talking about an input decision tree $\tr$, we adopt the notation of~\citep{Lundberg2020}.
$\tr$ is a binary tree based on single-variable splits:
each non-leaf node $v\in \tr$ is assigned a \emph{feature} $d_v$, and
a \emph{threshold}~$t_v$, whereas each leaf~$l$ is assigned a real \emph{value}~$f(l)$.
Let $a_v,b_v$ denote the left and right children of a non-leaf
node $v\in \tr$. The output $f_\tr(x)$~of the tree $\tr$ is computed by following a root-leaf
path in $\tr$: at a non-leaf node $v\in \tr$, we descend to $a_v$ if
$x_{d_v}< t_v$, or to $b_v$ otherwise. When a leaf is reached,
its value is returned. Denote by $\lvs(\tr)$ the set of leaves of $\tr$.
Denote by $\tr[v]$ the subtree of $\tr$ rooted at~$v$.


\paragraph{Set functions.}
We write $f(x_S,X_{\bar{S}})$ when referring to a random
variable being the value of $f$ if the
values for features in $S$ are fixed to the respective
values of $x$, and the values $X_{\bar{S}}$ are random
variables.
Let $X_U$ be distributed\footnote{In fact, here we can use any other distribution, possibly over some different validation data, such that the expectations $\EX[f(x_S,X_{\bar{S}})]$ can be estimated using Algorithm~\ref{alg:est}. This allows us to produce attributions that are contrastive to other baselines than the mean prediction over the training data.}  as in the training set of the model~$f$.
%\todo[inline]{Dodałem dwa footnoty dla reviewera 2.}
Recall that a \emph{set function} $g:2^{U}\to \mathbb{R}$
with $g(\emptyset)=0$, has to be fixed to talk
about the Shapley or Banzhaf value-based attributions $(\phi_i)_{i\in U}$ and $(\beta_i)_{i\in U}$ as defined in Equations~\eqref{eq:shap}~and~\eqref{eq:ban}, resp.,
\citet{Lundberg2020} and \citet{JanzingMB20} suggest
using the following
idealized\footnote{It might seem that using marginal expectation instead of conditional expectation here leads to inclusion of unrealistic data when features are highly dependent. However,~\citet{JanzingMB20} gave some compelling reasons why this is still a reasonable choice.}   set function $g^*$ for feature attribution:
\begin{equation}\label{eq:g-ideal-def}
  g^*(S):=\EX[f(x_S,X_{\bar{S}})]-\EX[f(X_U)].
\end{equation}
Note the term $\EX[f(X_U)]$ in~(\ref{eq:g-ideal-def}) serves the purpose of having $g(\emptyset)=0$ and
cancels out when computing the Shapley value from Equation~\eqref{eq:shap}.
Thus, for simplicity in the following we can redefine
${g^*(S):=\EX[f(x_S,X_{\bar{S}})]}.$

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{algorithm}[t]
\caption{Estimating $\EX[f(x_S,X_{\bar{S}})]$}\label{alg:est}
\begin{algorithmic}[1]
\Function{$\textsc{Desc}$}{$S, v$}
    \If{$v$ is a leaf}
    \State \textbf{return} $f(v)$
    \EndIf
    \If{$d_v\in S$}

    \If{$x_{d_v}< t_v$}
      \State \textbf{return} $\textsc{Desc}(S,a_v)$
    \Else
      \State \textbf{return} $\textsc{Desc}(S,b_v)$
    \EndIf
    \Else
      \State \textbf{return} $\frac{r_{a_v}}{r_v}\cdot \textsc{Desc}(S,a_v)+\frac{r_{b_v}}{r_v}\cdot \textsc{Desc}(S,b_v)$
    \EndIf
\EndFunction
\Function{$g$}{$S$}
    \State \textbf{return} $\frac{1}{T}\cdot \sum_{i=1}^T \texttt{Desc} (S,\rho_i)$
\EndFunction
\end{algorithmic}
\end{algorithm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Using the idealized set function $g^*$ would be computationally
too costly.
Consequently, \citet{Lundberg2020} in their
\texttt{TREESHAP\_PATH}\footnote{We will sometimes use an abbreviated name \texttt{TREESHAP}.} algorithm considers
the set function $g$ whose purpose is to ``approximate'' $g^*$.
Namely, $g(S)\approx g^*(S)$ is computed as shown in Algorithm~\ref{alg:est}.
This method dates back to the classical work of \citet{friedman2001}
and is also implemented as a way to compute partial dependence plots
in the scikit-learn package~\citep{scikit-learn}.
Its one advantage is that it does not require access
to the training data, but merely to the ``coverages'' $r_v$ of all the subtrees $\tr[v]$ (for all trees $\tr$ in the ensemble),
i.e., the numbers of training set points that fall into $\tr[v]$.
It can be proved that this method approximates $\EX[f(x_S,X_{\bar{S}})]$ well
if the individual feature random variables $X_i$ are independent.
With such a set function~$g$, \citet{lundberg2018consistent, Lundberg2020} show how to compute the
Shapley value attributions $(\phi_i)_{i\in U}$ exactly
in $O(TLD^2+n)$ time.


\iffalse
On the other hand, in the \texttt{TREESHAP\_INT} algorithm, Lundberg et al.~\shortcite{Lundberg2020}
estimate $g^*(S)$ by sampling some number $R$ of random points $x'$ of the
training data and computing the average value of $f(x_S,x'_{\bar{S}})$
over these samples.\footnote{In~\cite{SundararajanN20} this method is called \emph{Random Baseline Shapley}.}
Note that if the entire data was sampled, this would compute the
desired expectation exactly. The computation
cost would be then unacceptable, though.
The \texttt{TREESHAP\_INT} algorithm computes the Shapley value
$\phi_i$ exactly (for the described approximation of $g^*$) in $O(TRL+n)$ time.
We stress that this method requires access to the training data.
\fi

In the remaining part of the paper, we denote by $g(S)$ the
output of Algorithm~\ref{alg:est} for the subset $S\subseteq U$,
i.e., we consider the same approximation of $g^*(S)$ as
in the \texttt{TREESHAP\_PATH} algorithm of~\citet{Lundberg2020}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Banzhaf Value Algorithm}\label{s:algo}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


In this section, we introduce an optimal ${O(TL+n)}$ time algorithm, called \ban{},
for computing attributions based on the Banzhaf value.
For clarity, let us assume first that
there is just a single tree $\tr$ in the model, i.e., $T=1$.
This is without loss of generality, since the prediction of an ensemble model is simply the average of the predictions
produced by individual trees.
We describe the algorithm for arbitrary $T$ later on.
Due to space constraints, the proofs of technical lemmas
can be found in
\ifsupp
Appendix~\ref{a:omitted}.
\else
the supplementary material.
\fi

Let~$\rho$ denote the root of~$\tr$, and~$p_v$ the parent of node $v\in \tr$, $v\neq \rho$. Furthermore, let $F_v$ be the set of features assigned to the ancestors
of $v$, i.e., ${F_\rho=\emptyset}$, and $F_v=F_{p_v}\cup \{d_{p_v}\}$ for $v\neq \rho$. 
The value $P[v]=r_v/r_\rho$ can be thought as the probability that the model
returns a value from $\tr[v]$.

Algorithm~\ref{alg:est} computes the estimate $\EX[f(x_S,X_{\bar{S}})]$.
Observe that the output of this algorithm for $S=\emptyset$
is precisely equal to ${\sum_{l\in\lvs(\tr)} P[l]\cdot f(l)}$.
More generally, denote by
$P[v,S]$ the weight from the ancestor recursive calls
assigned to the subtree rooted at $v$
when running Algorithm~\ref{alg:est} with an arbitrary $S\subseteq U$.
Formally, $P[\rho,S]=1$, and for any $v\neq \rho$,
\begin{equation*}
  P[v,S]=\begin{cases}
      P[p_v,S]\cdot \frac{r_v}{r_{p_v}} & \text{ if } d_{p_v}\notin S,\\
      P[p_v,S]\cdot [x_{d_{p_v}}< t_{p_v}] & \text{ if }  d_{p_v}\in S, v=a_{p_v},\\
      P[p_v,S]\cdot [x_{d_{p_v}}\geq t_{p_v}] & \text{ if }  d_{p_v}\in S, v=b_{p_v}.\\
  \end{cases}
\end{equation*}

Then, the algorithm outputs
\begin{equation}\label{eq:tpd}
  \sum_{l\in\lvs(\tr)} P[l,S]\cdot f(l)=g(S)\approx g^*(S).%=\EX[f(x_S,X_{\bar{S}})].
\end{equation}

In our approach, each of the desired attributions $\beta_i$ is obtained
by summing the contributions of each individual leaf $l\in\lvs(\tr)$
%with $i\in F_l$
to the sum~(\ref{eq:ban}) with $g$ defined as in~(\ref{eq:tpd}). More precisely:
\begin{equation*}
  \beta_i=\sum_{l\in\lvs(\tr)}\left(\frac{f(l)}{2^{n-1}}\sum_{S\subseteq U\setminus\{i\}}\left(P[l,S\cup \{i\}]-P[l,S]\right)\right).
\end{equation*}

We now introduce the following crucial intermediate values
that will enable us to evaluate the above formula efficiently.
For any $v\in \tr$, and subset $G\subseteq U$, let
\begin{equation}\label{e:ban-p}
  \beta(v,G):=\frac{1}{2^{|G|}}\sum_{S\subseteq G} P[v,S].
\end{equation}
%We will later prove the following lemma.

\begin{restatable}{lemma}{lreduce}\label{l:reduce}
For any $i\in U$, we have:
\begin{equation*}
\beta_i=\sum_{\substack{l\in\lvs(\tr)\\i\in F_l}}2f(l)\cdot\left(\beta(l,F_l)-\beta(l,F_l\setminus\{i\})\right).
\end{equation*}
\end{restatable}
Lemma~\ref{l:reduce} reduces computing the Banzhaf value to
computing $O(L)$ values of the form $\beta(l,F_l)$, and $O(L\cdot D)$ values
of the form $\beta(l,F_l\setminus \{i\})$, for all $(l,i)$ such that $l\in \lvs(\tr)$
and $i\in F_l$. The $O(L\cdot D)$ bound follows since each leaf has no
more than $D$ ancestors, which implies $|F_l|\leq D$.

In the following part of the section, we first give a
recursive formula for computing the values $\beta(v,G)$ efficiently
using dynamic programming.
%This will also yield a proof of Lemma~\ref{l:reduce}.
Next, we show a simpler $O(LD)$ time algorithm computing all
the values $\beta(\cdot,\cdot)$ required by Lemma~\ref{l:reduce}.
As a final step,
we show how to improve the worst-case running time of the algorithm to optimal $O(L)$.

\paragraph{Recurrence.}
To proceed, we will need the auxiliary values $\Delta_{v,y}$ for $v\in\tr$ and $y\in U$,
defined inductively as follows:
\begin{equation*}\label{eq:delta}\Delta_{v,y}=\begin{cases}
  1 & \text{if }v=\rho,\\
  \Delta_{p_v,y} & \text{if } d_{p_v}\neq y\text{ and }v\neq\rho,\\
  \Delta_{p_v,y}\cdot [x_y<t_{p_v}]\cdot \frac{r_v}{r_{p_v}} & \parbox{.15\textwidth}{if $d_{p_v}=y$ and $a_{p_v}=v\neq\rho$,}\\
  \Delta_{p_v,y}\cdot [x_y\geq t_{p_v}] \cdot \frac{r_v}{r_{p_v}} & \parbox{.15\textwidth}{if $d_{p_v}=y$ and $b_{p_v}=v\neq\rho$.}
\end{cases}
\end{equation*}
The above auxiliary values can be in turn used to recursively
compute the values $P[\cdot,\cdot]$.
\begin{restatable}{lemma}{laddfeature}\label{l:add_feature}
Let $v\in \tr$ and $Q\subseteq U$ and $y\in U\setminus Q$. Then:
\begin{equation*}
  P[v,Q\cup\{y\}]=P[v,Q]\cdot \Delta_{v,y}.
\end{equation*}
\end{restatable}


Lemma~\ref{l:add_feature} applied to~\eqref{e:ban-p} allows computing the values $\beta(v,G)$ recursively,
as stated in the below lemmas.

\begin{restatable}{lemma}{ldpban}\label{l:dp-ban}
  Let $v\in \tr$ and $G\subseteq U$. Let $y\in U\setminus G$. Then:
  \begin{equation*}
    \beta(v,G\cup\{y\})=\frac{1}{2}\left(1+\Delta_{v,y}\right)\beta(v,G).
  \end{equation*}
\end{restatable}

\begin{restatable}{lemma}{lliftban}\label{l:lift-ban}
  Let $v\in \tr$, $v\neq\rho$. Then, for any $Q\subseteq U\setminus \{d_{p_v}\}$,
  \begin{equation*}
    \beta(v,Q)=\beta(p_v,Q)\cdot \frac{r_v}{r_{p_v}}.
  \end{equation*}
\end{restatable}

\begin{algorithm}[t]
  \caption{Computing $\beta[l]=\beta(l,F_l)$ for all $l\in \lvs(\tr)$.}
\label{alg:traverse}

\begin{algorithmic}[1]
\Procedure{$\textsc{Traverse}$}{v}
\If {$d_{p_v}\in F$}
\State $\textrm{present}:=\textbf{true}$\Comment{record that $d_{p_v}$ in $F_{p_v}$}
\State $b:=\frac{2}{1+\delta[d_{p_v}]}\cdot \beta[p_v]$\Comment{ $b=\beta(p_v,F_{p_v}\setminus d_{p_v})$}
\Else
\State $\textrm{present}:=\textbf{false}$
\State $F:=F\cup \{d_{p_v}\}$\Comment{ensure $F=F_v$}
\State $b:=\beta[p_v]$\Comment{ $b=\beta(p_v,F_{p_v}\setminus d_{p_v})$}
\EndIf
\State $\delta_{\text{old}}:=\delta[d_{p_v}]$
\If{$v=a_{p_v}$}
  \State $\delta[d_{p_v}]:=\delta[d_{p_v}]\cdot [x_y<t_{p_v}]\cdot \frac{r_v}{r_{p_v}}$
\Else
\State $\delta[d_{p_v}]:=\delta[d_{p_v}]\cdot [x_y\geq t_{p_v}]\cdot \frac{r_v}{r_{p_v}}$
\EndIf
\State $\delta^*[v]:=\delta[d_{p_v}]$\Comment{store $\Delta_{v,d_{p_v}}$ for future use}
\State $b:=b\cdot r_v/r_{p_v}$\Comment{$b=\beta(p_v,F_{v})$}
\State $\beta[v]:=b\cdot \frac{1}{2}(1+\delta[d_{p_v}])$ \Comment{Lemma~\ref{l:dp-ban}}
\If {$v\notin \lvs(\tr)$}
  \State $\textsc{Traverse}(a_v)$
  \State $\textsc{Traverse}(b_v)$
\EndIf
\If{$\textrm{present}=\textbf{false}$}\Comment{revert changes to $F,\delta$}
\State $F:=F\setminus {d_{p_v}}$
\EndIf
\State $\delta[d_{p_v}]:=\delta_{\text{old}}$
\EndProcedure
\end{algorithmic}

\end{algorithm}
\subsection{Basic algorithm}\label{s:algo-basic}

Equipped with Lemmas~\ref{l:dp-ban} and~\ref{l:lift-ban}, one can easily move between ``nearby'' values
$\beta(G,v)$.
Namely, for any ${i\in U}$, given $\beta(v,G)$ and $\Delta_{v,i}$,
each of the values
$\beta(a_v,G)$, $\beta(b_v,G)$, $\beta(v,G\cup\{i\})$
can be computed in $O(1)$ time.

Moreover, the values $\beta(p_v,G)$, $\beta(v,G\setminus\{i\})$ can also be obtained in $O(1)$ time
by applying
the respective ``inverse'' forms of these lemmas.
We now stress that being able to compute $\beta(v,G\setminus\{i\})$
out of a value of the form $\beta(v,G)$,
i.e., removing elements from the feature set $G$, is crucial
for two reasons. 
First, recall that we need to obtain values of the form $\beta(l,F_l\setminus\{i\})$ for all leaves $l$
and all $i\in F_l$. For all such~$i$, this value
can be obtained using a single inverse application of Lemma~\ref{l:dp-ban}.
Moreover, applying Lemma~\ref{l:lift-ban} to obtain
$\beta(v,F_v)$ out of the parent value $\beta(p_v,F_{p_v})$ requires
$d_{p_v}\notin F_{p_v}$.
This may be violated if $F_v=F_{p_v}$, i.e., $d_{p_v}$ is a feature in some other ancestors
of $v$ in the tree (which does happen in practical models).
In such a case, the inverse Lemma~\ref{l:dp-ban} can be used
to first compute $\beta(p_v,F_v\setminus \{d_{p_v}\})$, then we apply Lemma~\ref{l:lift-ban}
to obtain $\beta(v,F_v\setminus \{d_{p_v}\})$,
and finally we again use Lemma~\ref{l:dp-ban} to get $\beta(v,F_v)$. 

The basic algorithm (which is similar in its essence to \texttt{TREESHAP\_PATH}),
computes all the values $\beta(v,F_v)$ for $v\in \tr$ -- as explained above -- using a
simple recursive tree traversal in $O(L)$ time.
In particular, this also gives all the values $\beta(l,F_l)$ that we need
when invoking Lemma~\ref{l:reduce}.
Afterwards, for each leaf $l\in \tr$, the remaining (again, required by the formula
in Lemma~\ref{l:reduce})
$|F_l|$ values of the form
$\beta(l,F_l\setminus\{i\})$ for $i\in F_l$ can be computed in $O(1)$ extra time
each using Lemma~\ref{l:dp-ban}.
As a result, through all pairs $(l,i)$, this takes
 $ O\left(\sum_{l\in \lvs(\tr)}|F_l|\right)=O(LD)$
time.

The above analysis silently assumed that all the needed auxiliary values $\Delta_{v,y}$ can
be accessed in $O(1)$ time.
We now justify this assumption.
During the tree traversal we store a global array $\delta$
indexed with the features $U$.
We maintain an invariant that $\delta[y]$ equals $\Delta_{p_v,y}$
when the processing of a vertex $v$ starts and also when it finishes.
By~\eqref{eq:delta}, to guarantee the invariant is satisfied upon the recursive traversals of the
subtrees rooted at $a_v$ or $b_v$, we may possibly
need to update only the value $\delta[d_v]$ according to~\eqref{eq:delta}, because
$\Delta_{v,y}\neq \Delta_{a_v,y}$ or $\Delta_{v,y}\neq \Delta_{b_v,y}$ may only happen when $y=d_v$.
When a recursive traversal returns, we revert that change to %the cell
$\delta[d_v]$.

The pseudocode of a recursive procedure $\textsc{Traverse}$ computing all the values $\beta(l,F_l)$, which
we also require in our optimal algorithm, is
given as Algorithm~\ref{alg:traverse}. In this procedure, each of the computed values $\beta(v,F_v)$
is recorded in a global array as $\beta[v]$.
The auxiliary global variable $F$ stores the set $F_v$ when
node $v$ is processed; $F$ can be implemented using a bitmap of size $n$.
\subsection{The optimal algorithm}\label{s:algo-opt}
The high-level idea behind our improved algorithm is to avoid
computing all the leaf contributions to the individual
components $\beta_i$ of the Banzhaf value separately.
Instead, for every node $v\in \tr$, $v\neq \rho$, such that $d_{p_v}=i$, we compute
the total contribution to
$\beta_i$ of \emph{all} the leaves $\lvs_v\subseteq  \tr[v]$, defined
to be the subset of leaves for which~$v$
constitutes the \emph{nearest} weak ancestor (i.e., a node is considered its own ancestor) with $d_{p_v}=i$, at once.

Note that for a given $i\in U$, the sets $\lvs_v$ for $v\in \tr$ satisfying
${d_{p_v}=i}$, are pairwise disjoint, and in fact form a partition of the set
$\{l\in \lvs(\tr):i\in F_l\}$
through which summation in Lemma~\ref{l:reduce} is performed.
Additionally, observe that the values
$\Delta_{l,d_{p_v}}$ are equal to $\Delta_{v,d_{p_v}}$ for all leaves $l$ in $\lvs_v$.

Consider the following values for all $v\in \tr$, $v\neq\rho$:
\begin{align*}
B(v)=\sum_{l\in\lvs_v}f(l)\cdot \beta(l,F_l).
\end{align*}
The below lemma shows that computing the Banzhaf value $\beta$
can be reduced, in linear time, to computing all the
values $B(v)$, $v\in \tr\setminus\{\rho\}$: indeed, each $B(v)$ appears in the sum below for precisely one $i\in U$.

\begin{restatable}{lemma}{lbcontribs}\label{l:b-contribs}
For any $i\in U$, we have:
\begin{equation*}
    \beta_i=\sum_{\substack{v\in\tr\setminus \{\rho\}\\d_{p_v}=i}}\frac{2(\Delta_{v,i}-1)}{1+\Delta_{v,i}}\cdot B(v).
\end{equation*}
\end{restatable}


We have previously showed that the values $\beta(l,F_l)$ can be computed
in linear time.
We now describe a recursive procedure $\textsc{Fast}(u)$, where $u\neq \rho$, computing $B(v)$ for all $v\in\tr[u]$ in a bottom-up manner.
Let 
\begin{equation*}
    S(v)=\sum_{v\in \lvs(\tr[v])} f(l)\cdot \beta(l,F_l),
\end{equation*}
that is, $S(v)$ sums the values $f(l)\cdot \beta(l,F_l)$ in $\tr[v]$. Clearly, for each $l\in\lvs(\tr)$, we have $S(l)=f(l)\cdot\beta(l,F_l)$,
and for a non-leaf $v\in\tr$, $S(v)=S(a_v)+S(b_v)$ holds.
As a result, all the values $S(v)$ can be computed in linear time
using a bottom-up computation over the tree.
    \begin{algorithm}[tb]
  \caption{Computing the values $B(v)$ for all $v\in \tr$.}
\label{alg:fast}

\begin{algorithmic}[1]
\Procedure{$\textsc{Fast}$}{v}
\State $H[d_{p_v}].\textsc{Push}(v)$
\If {$v\in \lvs(\tr)$}
  \State $S[v]:=f(v)\cdot \beta[v]$
\Else
    \State $\textsc{Fast}(a_v)$
    \State $\textsc{Fast}(b_v)$
    \State $S[v]:=S[a_v]+S[b_v]$
\EndIf
\State $z:=0$\Comment{$z$ stores the sum $\sum_{w\in Q_v}S(w)$}
\While {$H[d_{p_v}].\textsc{Top}()\neq v$}
    \State $z:=z+S[H[d_{p_v}].\textsc{Top}()]$
    \State $H[d_{p_v}].\textsc{Pop}()$
\EndWhile
\State $B[v]:=S[v]-z$
\If{$|H[d_{p_v}]|=1$}\Comment{empty $H[d_{p_v}]$ if $g_v=\perp$}
    \State $H[d_{p_v}].\textsc{Pop}()$
\EndIf
\EndProcedure

\end{algorithmic}

\end{algorithm}

Given the sums $S(v)$, we proceed as follows.
For $v\in\tr$, let $Q_v$ be the set of non-leaf nodes $w\in \tr[v]$ with $d_{p_w}=d_{p_v}$ and~$v$ is the nearest ancestor
of $w$ with $d_{p_w}=d_{p_v}$. We have:
  %\begin{equation*}
    $\lvs_v = \lvs(\tr[v])\setminus \left(\bigcup_{w\in Q_v} \lvs(\tr[w])\right)$,
  %\end{equation*}
  and thus
  \begin{equation*}
    B(v)=S(v)-\sum_{w\in Q_v}S(w).
  \end{equation*}
 Observe that the total size of sets $Q_v$ (over all $v\in\tr$) is $O(L)$,
  so if we are allowed to iterate through $Q_v$ whenever we wish to compute
  $B(v)$, the computation of $B(v)$
  takes $O(L)$ time as well.
  We now explain how to accomplish this.
  Let $g_w$ denote the nearest ancestor of $w\in \tr$
  with $d_{p_w}=d_{p_{g_w}}$.
  One way to enable iterating through
  $Q_v$ when $v$ is processed bottom-up,
  is to maintain, for each feature $j\in U$,
  a global stack $H[j]$ containing
  all the nodes $w$ such that $d_{p_w}=j$
  and that the computation for $w$ (i.e., the call $\textsc{Fast}(w)$) has already been started or completed,
  but the computation for $g_w$
  has not yet completed.
  The stack elements are sorted using
  the pre-order of the nodes of $v$, so that
  the node $w$ with the highest pre-order
  is at the top of $H[d_{p_w}]$.
  The stack can be updated in $O(1)$ time
  whenever a recursive call starts.
  Observe that $v\in H[d_{p_v}]$ when $\textsc{Fast}(v)$
  has started but has not yet finished.
  Now, given $H[d_{p_v}]$, it is enough to note
  that $Q_v$ equals precisely the set of elements
  of $H[d_{p_v}]$ closer to the top of the stack than $v$.
  Thus, one can indeed iterate through $Q_v$
  in $O(|Q_v|)$ time as desired.
  Moreover, $Q_v$ constitutes precisely the
  set of elements that have to be popped from 
  $H[d_{p_v}]$ when $\textsc{Fast}(v)$ returns.
  The asymptotic cost of popping stack elements can charged
  to the corresponding pushes and thus can be neglected.
  
  A pseudocode of the procedure $\textsc{Fast}$ computing
  all the values $B(v)$ given the values $\beta(l,F_l)$ is
  given in Algorithm~\ref{alg:fast}. In Algorithm~\ref{alg:banzhaf} we give a pseudocode
  of the full algorithm computing the Banzhaf value-based
  attributions for a tree ensemble model
  $(\tr_j)_{j=1}^T$. Since the value of such a model
  is defined to be the average prediction over all the individual tree predictions, the final attribution
  is simply the average of the individual attributions.
  \begin{algorithm}[tb]
  \caption{Computing the attributions $(\beta_i)_{i=1}^n$ of the tree ensemble model's $(\tr_j)_{j=1}^T$ prediction $f(x)$.}
\label{alg:banzhaf}
\begin{algorithmic}[1]
\Function{$\textsc{BanzhafAttribution}$}{$n,(\tr_j)_{j=1}^T$}
    \For{$i\in U$}\Comment{initialize global data}
        \State $\beta_i:=\beta[i]:=0$\Comment{$(\beta_i)_{i=1}^n$ stores the result} 
        \State $\delta[i]:=1$
        \State $H[i]=\text{empty stack}$
    \EndFor
    \State $F:=\emptyset$
    \For{$j=1,\ldots,T$}
        \State $\rho:=\text{the root node of }\tr_j$
        \For{$v\in \{a_\rho,b_\rho\}$}
        \State $\textsc{Traverse}(v)$
        \State $\textsc{Fast}(v)$
        \EndFor        
        \For{$v\in \tr_j\setminus\{\rho\}$}
            \State $\beta_{d_v}:=\beta_{d_v}+\frac{2(\delta^*[v]-1)}{1+\delta^*[v]}\cdot B[v]$\Comment{Lemma~\ref{l:b-contribs}}
        \EndFor
    \EndFor
    \State $\textbf{return }(\beta_i/T)_{i=1}^n$\Comment{average through the $T$ trees}
\EndFunction
\end{algorithmic}

\end{algorithm}
  We have thus proved:% the following result:
  \begin{theorem}\label{t:banzhaf}
    Let $n=|U|$. The Banzhaf value-based attribution $(\beta)_{i\in U}$ of a prediction of a tree ensemble
    model consisting of $T$ trees with at most $L$ leaves each, can be computed in optimal $O(TL+n)$ time.
  \end{theorem}

We remark that if the ensemble contains $T$ trees of very different
sizes, the time can be more precisely bounded~by $O\left(\sum_{i=1}^T|\tr_i|+n\right)$,
i.e., remains optimal in the input size.

Finally, it is worth noting that the above approach to
speeding-up the basic algorithm can be also successfully
applied to reduce the time complexity of the
\texttt{TREESHAP\_PATH} attribution algorithm of \citet{Lundberg2020}
from $O(TLD^2+n)$ to $O(TLD+n)$.
\ifsupp
Due to space constraints, we defer the details to
Appendix~\ref{a:shapley}.
\else
This is desribed in detail in the supplementary
material,
\fi


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experimental Analysis}\label{sec:experiment}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The goals of our experiments are threefold:
\begin{itemize}
    \item \textit{Time performance} --- first, we test the performance of the \ban{} algorithm proposed in the previous section and compare it to the performance of the \texttt{TREESHAP\_PATH}  algorithm by \citet{Lundberg2020}---the state-of-the-art algorithm for the Shapley value attributions for tree models.
    \item \textit{Qualitative differences} --- next, we investigate whether the Banzhaf value returns qualitatively different results than the Shapley value for tree models.
    \item \textit{Numerical accuracy} --- finally, we compare numerical accuracy of both algorithms.
\end{itemize}

\subsection{Experimental setup and datasets}
In our experiment we use both the sklearn implementation of Decision Trees (DT) or xgboost implementation of Gradient Boosting Decision Trees (GBDT). These are some of the most popular algorithms for generating decision trees and are quite often used for large depths of trees. Using large-depth trees is particularly beneficial for datasets with many features and complex relationship between them (see e.g., \citep{Bordag2021.04.18.21254782,Pham2019MultimodalDO} for a usage of trees of depth 100).  Let us emphasize that large depth of a tree, e.g. depth 100, does not mean the size of the tree is $2^{100}$, because trees might be (and usually are) unbalanced. To simplify the experiments and reduce the their running times, we trained the DT algorithm to generate only one tree.
We use four ``real-world'' datasets (see Table~\ref{tab:datasets} for key details):

\begin{table}[t]
\begin{tabular}{@{\hspace{0em}}l@{\hspace{0.5em}}c@{\hspace{0.7em}}c@{\hspace{0.7em}}c@{\hspace{0.7em}}c@{\hspace{0.7em}}c@{\hspace{0.7em}}r@{\hspace{0em}}}
\hline
name &  rows & cols & \ tree & iter. & \ max & learning  \\
 &   &  & depth &  & depth &  \ \ \ rate \\
\hline
\bos{} & 506 & 13 & 10  & 100  & 6 & $0.01$  \\
\nh{} & 8023 & 79  & 40 & 250  & 4 & $0.2$ \\
\texttt{VEH.INS.} & 304887 & 14 & 60 & 250 & 4 & $0.2$  \\
\fl & 1543718 & 647  & 100 & 250 & 10 & $0.2$\\
\hline
\end{tabular}
\caption{The sizes of datasets and parametrisation of the experiments. The ``tree depth'' column reports tree\_depth of the decision tree (DT) with all the other parameters set to default values. The ``iterations'', ``max depth'' and ``learning rate'' columns are the parameters used for training xgboost.}\label{tab:datasets}
\end{table}

\begin{enumerate}
  \item \bos{} (abbr. \texttt{BS}).~\citep{bostondataset}. 
This small prediction dataset contains information concerning housing in the
area of Boston Massachusetts. The task is to predict the price of the house. 
 \item \nh{} (\texttt{NH}). 
 The same dataset that was used in previous work on tree model interpretability \citep{Lundberg2020} which our work most closely relates to. The parameters used for training were the same as in \citep{Lundberg2020}.
 \item \hi{} (\texttt{VI}).~\citep{vehicledataset}. 
A medium size dataset for predicting who might be interested in vehicle insurance based on health insurance data.
\item \fl{} (\texttt{FL}).~\citep{flightsdataset}. 
A large dataset for predicting the flights' delays. A large number of columns was caused by one-hot encoding 'UniqueCarrier', 'Origin', 'Dest',
'CancellationCode' in a standard way, i.e., for each possible value $v$ of a given column $c$ we created additional categorical column $c\_v$ ($v\in\{0,1\}$) indicating that the value of $c$ equals $v$ iff the value of $c\_v$ equals~$1$.
\end{enumerate}

We will refer to the above
datasets by adding ``DT'' and ``GB'' suffixes (for DT and GBDT algorithms, resp.) to the ordinal name of the prediction dataset. Note that the parameters were not extensively tuned since our main goal here centers around interpreting
models and not optimizing them.

All our experiments were performed using Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz with 512 Gb of RAM using only one thread for computation. The operating system was Ubuntu 18.04.2 LTS.
Our linear-time $\ban{}$ algorithm was implemented in C++, whereas for $\texttt{TREESHAP\_PATH}$, we used to original C language implementation from the SHAP package~\citep{shappackage}.
The binaries were compiled using clang version 6.0.0-1ubuntu2 with -O3 optimization.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Comparison of running times}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{table}[t]
\centering
{\renewcommand{\arraystretch}{1}
\begin{tabular}{@{\hspace{0em}}l@{\hspace{0.3em}}c@{\hspace{0.2em}}c@{\hspace{0em}}l@{\hspace{0.2em}}l@{\hspace{0.2em}}c@{\hspace{0.3em}}c@{\hspace{0em}}}
\cline{1-3}\cline{5-7}
 & \small{\ban{}} & \small{\texttt{TREESHAP}} & & & \small{\ban{}} & \small{\texttt{TREESHAP}}\\
\cline{1-3}\cline{5-7}
\small{BS\_GB} & \small{0.48 s}    & \small{0.70 s}   &  & \small{BS\_DT} & \small{0.41 s}    & \small{0.41 s}     \\
\small{VI\_GB} & \small{23.63 s}   & \small{35.32 s}  &  & \small{NH\_DT} & \small{3.57 s}    & \small{42.87 s}    \\
\small{NH\_GB} & \small{50.20 s}   & \small{1 m 28 s} &  & \small{VI\_DT} & \small{4 m 55 s}  & \small{30 m 55 s}  \\
\small{FL\_GB} & \small{13 m 18 s} & \small{48 m 8 s} &  & \small{FL\_DT} & \small{14 m 28 s} & \small{5 h 9 m}    \\
\cline{1-3}\cline{5-7}
\end{tabular}}
 \caption{Running times of the two attribution algorithms on the entire dataset. We observe that \ban{} is substantially faster than \texttt{TREESHAP\_PATH} on each instance.}
  \label{tab:running_times}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this section, we compare the running times of the algorithms. For each
of the instances, the task was to compute the attributions
of \emph{all} individual data points. In Table~\ref{tab:running_times} we show the running times for different examples. We conclude that \ban{} is consistently faster than \texttt{TREESHAP\_PATH}, and using it can lead to considerable time savings for larger data-sets.
As anticipated by the theoretical worst-case time complexity analysis, the observed speed-up increases with the
depth of trees in the model.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Comparison of feature scores}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We test whether the Banzhaf value assigns qualitatively different importance to features than the Shapley value. The comparison is performed
from two viewpoints.

\paragraph{Global importance.} First, we compare the global importances of individual features for the model.
To this end, we apply the same measure of \textit{global impact} of a feature as in \citep{Lundberg2020}. Let~$\mathcal{D}$ be some dataset.
Suppose for each $i\in U$ we have some feature attribution function $\gamma_i:\mathcal{D}\to \mathbb{R}$.
Let us consider the global impact of the feature over dataset $\mathcal{D}$
measured as
  $\Gamma_i=\sum_{x\in \mathcal{D}} |\gamma_i(x)|$.
For example, we can set $\gamma_i=\phi_i$ to get a \emph{Shapley global impact} $\Phi_i$,
or $\gamma_i=\beta_i$ to get a \emph{Banzhaf global impact} $B_i$.

For each of the datasets and algorithms we computed and plotted
the Shapley and Banzhaf global impacts. The obtained plots
can be found in
\ifsupp
Appendix~\ref{a:global-plots}.
\else
the supplementary material.
\fi

For \nh{}, \bos{}, and \hi{} datasets, the obtained plots of Banzhaf/Shapley global impacts, computed using \ban{} and \texttt{TREESHAP\_PATH} respectively,
are virtually indistinguishable.
For the larger instance based on the dataset \fl, only very small differences in the ordering of features by importance can be observed for both \flgb{}
and \fldt{}.

\paragraph{Specific data points.} 
We now turn to describing how much the obtained Banzhaf and Shapley attributions
deviate from each other for specific data points.
To measure the difference between the feature orderings produced by both methods, we computed the \emph{modified Cayley distance} between
the respective orderings of $n\in \{3,10,20\}$ most important features
for each data point, and took the average
over all data points.
The Cayley distance measures the number of swaps needed to switch from one permutation to another. In our modified version, we also support the case where the sets of considered most important features in the respective permutations are different. For a missing feature, we add it at the end of the permutation.
The results are presented in Table~\ref{tab:cayley}.
They confirm that the differences are on average small; in particular for
the instances \bosgb{}, \nhgb{}, and \higb{}, for 98\% of the
data points, the respective 3 top features and their order matched.
The orderings deviation was generally larger for DT instances
where larger tree depths were allowed.

We also studied per-feature average differences between the values of Banzhaf and Shapley
attributions for each of the datasets.
We consider both MAD (Mean Average Difference) and RMSD (Root Mean Square Difference).
\ifsupp
See Appendix~\ref{a:errors} for the relevant plots.
\else
The relevant plots can be found in the supplementary material.
\fi
Formally, for each dataset $\mathcal{D}$ out of those and each feature~$i$ used therein, these are defined as:
$\text{MAD}_i= \frac{1}{|\mathcal{D}|}\sum_{x\in\mathcal{D}}|\phi_i(x) - \beta_i(x)|$ and
$\text{RMSD}_i = \sqrt{\frac{1}{|\mathcal{D}|}\sum_{x\in\mathcal{D}}(\phi_i(x) - \beta_i(x))^2}$.
Here, $\phi_i(x)$ denotes the Shapley attribution of $f(x)$ for data point $x\in\mathcal{D}$, as computed by \texttt{TREESHAP\_PATH}{}.
Similarly, $\beta_i(x)$ denotes the Banzhaf attribution
as computed by \ban{}.

For the ``smaller'' instances \bosgb{}, \nhgb{}, and \higb{} and all features, the observed MAD and RMSD differences
did not exceed 5\% of the corresponding global impacts.
For the remaining larger models, the MAD difference did not exceed 20\% for the top features.
On the other hand, for the large-depth $\fldt{}$ model, the RMSD difference reached around 50\% even for the top features, which suggests there were data points with very big absolute differences in the produced attributions.
These differences indicate that when looking at specific data points
one should expect only small differences in the ordering of features and only for features with similar scores.
The differences are expected to be larger for larger models.

The average error statistics also show an interesting phenomenon that, for the studied datasets and models, the per-feature Banzhaf and Shapley attributions are very close to each other even though the Banzhaf value does not satisfy the \emph{Efficiency axiom} (in contrast to the Shapley value) and thus the sum of the produced feature scores does not typically sum up to the difference between the prediction and the ``baseline'' mean prediction $\EX[f(X_U)]$.

\begin{table}[t]
\centering
{\renewcommand{\arraystretch}{1}
\begin{tabular}{@{\hspace{0em}}l@{\hspace{0.6em}}c@{\hspace{0.6em}}c@{\hspace{0.6em}}r@{\hspace{0.6em}}r@{\hspace{0.6em}}l@{\hspace{0.6em}}c@{\hspace{0.6em}}c@{\hspace{0.6em}}r@{\hspace{0.6em}}r@{\hspace{0em}}}
\cline{1-4}\cline{6-9}
 \small{Ins/n} & \small{3} & \small{10} & \small{20} & & \small{Ins/n} & \small{3} & \small{10} & \small{20}\\
\cline{1-4}\cline{6-9}
\small{BOS\_GB} &  \small{0.02} &  \small{1.05} & & & \small{BOS\_DT \ \ \ } \vspace{0em}&  \small{0.08} &  \small{1.7}  \\
\small{NH\_GB} & \small{0.01} & \small{0.34}  & \small{1.53} & & \small{NH\_DT} & \small{0.29} & \small{3.69} & \small{10.79}     \\

\small{VI\_GB} & \small{0.02} & \small{0.73} & & & \small{VI\_DT} & \small{0.13} & \small{2.60} & \\

\small{FL\_GB} & \small{0.4 } & \small{3.08} &  \small{8.63} & &  \small{FL\_DT} & \small{0.18} & \small{3.38}& \small{10.59}\\
\cline{1-4}\cline{6-9}
\end{tabular}}
\caption{The average modified Cayley distance for the $n$ most important features for $n\in \{3,10,20\}$ produced by \ban{} and \texttt{TREESHAP\_PATH} algorithms.\label{tab:cayley}
}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Numerical accuracy}\label{section:numerical}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The fact that the more significant differences between the obtained importances arised for large models suggested that the compared attribution algorithms might suffer numerical problems.
To investigate this possibility and compare numerical stability of $\ban{}$ and $\texttt{TREESHAP\_PATH}$, we considered a simple artificially prepared instance \syns{} for which we know the answer for both the Shapley value and the Banzhaf value.

In the \syns{} instance, the set of features is $U=\{1,\ldots,d\}$, where $d$ is a depth parameter. The instance contains one tree and one data point $x=[1,\dots,1]\in \mathbb{R}^d$. The tree consists of two subtrees of the same shape and depth $d-1$. All values $f(l)$ in the leaves are equal to $0$ and $777$ in the left and the right subtree of the root, resp. All leaves $l$ have coverages equal to $33$. Every internal node of depth $i$ has one leaf child, and one non-leaf child, whose (inductively defined) subtree has depth $i-1$. The split condition in an internal node at depth $i$ is $x_{d-i}< 1$. In this instance, the only feature with a nonzero Shapley/Banzhaf importance, equal to $388.5$, is the feature $d$ used to split at depth~$0$.
All other features have %Shapley/Banzhaf
importances equal to $0$.\footnote{This follows by the \emph{sensitivity} axiom (see, e.g.,~\citep{JanzingMB20}) that both Banzhaf and Shapley values satisfy.}

We have observed that for trees of depth $d\approx 50$, errors dominate the results, i.e., the relative error exceeds $1$.
In Figure~\ref{fig:numerical} we visualise the mean absolute errors
for $\texttt{TREESHAP\_PATH}$ and $\ban{}$ for the \syns{} instance.


We now give a potential reason why the Banzhaf value-based implementations may be
much more
stable in terms of the produced relative errors.
Recall that the values $\beta(l,F_l)$ for all
$l\in \lvs(\tr)$, $i\in F_l$ are computed via dynamic programming
using Lemmas~\ref{l:dp-ban}~and~\ref{l:lift-ban}.
Hence, they are all computed via
multiplications and divisions on \emph{positive} numbers
roughly between $0.5$ and~$r_\rho$.
In fact, the intermediate values $\beta(v,F_v)$ can be obtained
via $O(1)$ applications of Lemmas~\ref{l:dp-ban}~and~\ref{l:lift-ban} from the
``parent'' value $\beta(p_v,F_{p_v})$.
Such a computation can be proven to introduce
a multiplicative error between $1/(1+\epsilon)^{O(1)}$ and $(1+\epsilon)^{O(1)}$, where
$\epsilon$ is the machine epsilon.
This in turn implies a relative error bound of $(1+\epsilon)^{O(1)}-1$.
Moreover, by induction on the tree depth, we can easily obtain (see
\ifsupp
Appendix~\ref{s:numeric-bound}
\else
the suppl. material
\fi
for a proof):
\begin{lemma}\label{l:numeric-bound}
  The leaf values $\beta(l,F_l)$ can be computed with
  relative error at most $(1+\epsilon)^{O(D)}-1$.
\end{lemma}

\begin{figure}[t]
 \centering
\includegraphics[height=4.9cm]{img/shapley_numerical_error_14_l2_crop.png}
  \caption{The numerical error for \texttt{SYNTHETIC\_SPARSE}.}
\label{fig:numerical}
\end{figure}

This bound is quite pessimistic and at the same time not very large if double precision is used and
the tree depth $D$ is small enough.
On the other hand, if one considers computing the Shapley value
attributions, if one wants to retain the $O(LD^2)$ time bound of the \texttt{TREESHAP\_PATH} algorithm~\citep{Lundberg2020}, then 
it seems that \emph{subtractions} of intermediate values
are inherent.
Roughly speaking, this is because for Shapley-based attributions, if one applies an analogous
dynamic programming approach, then the Shapley-analogue of Lemma~\ref{l:dp-ban} involves a recursive formula that is a \emph{sum} of two ``earlier'' dynamic programming cells.\ifsupp
\footnote{See Lemma~\ref{l:dp} in Appendix~\ref{a:shapley}.}\else
\footnote{See the supplementary material for details.}
\fi
Recall, however, that our (and also Lundberg et al.'s) approach also required \emph{inverse} applications of Lemma~\ref{l:dp-ban}, especially when a single feature may appear multiple times on a root-leaf path.
For Shapley value
such an inverse application involves subtraction of equally-signed numbers.\footnote{In the original \texttt{TREESHAP} algorithm
subtractions of this kind manifest in line~31 of~\citep[Algorithm~2]{abs-1905-04610}.}

It is unclear if a similar (to Lemma~\ref{l:numeric-bound}) relative error bound  can be
proven in presence of such subtractions, which in general may
lead to so-called \emph{catastrophic cancellations}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The contribution of this paper is twofold. First, we have developed an efficient algorithm for computing feature importance measures for tree ensemble
models that is based on the Banzhaf value. This result improves the running time of previous state of the art.
Second, we have presented the first extensive comparison between the Shapley and Banzhaf values in this context. We observe that both methods deliver attributions of essentially the same strength by returning
almost the same ordering of features. However, these experimental results indicate that the Banzhaf value has an important advantage over the Shapley value, i.e.,
it allows for faster algorithms as well as these algorithms make much lower numerical errors.

We stress that this work identifies some computational/practical advantages of using the Banzhaf value compared to the Shapley value  for feature attribution in tree ensemble models (in particular, the algorithm by~\citet{Lundberg2020} that is commonly used by the practitioners). It would be also very interesting to compare the Shapley-based and Banzhaf-based attributions qualitatively. We believe that such a comparison requires a much more exhaustive study and is beyond the scope of this paper. However, it is, in our opinion, a very a compelling direction for future research.

\begin{acknowledgements}
This work has been partially supported by the ERC CoG grant TUgbOAT no 772346 and NCN project no 2020/37/B/ST6/04179.

We thank the anonymous reviewers for useful comments.
\end{acknowledgements}

\bibliography{karczmarz_300}

\ifsupp
\appendix
\onecolumn


\section{Global Impacts Comparison}\label{a:global-plots}

\begin{figure}[!ht]
  \centering
 \begin{subfigure}[t]{.49\textwidth}
 \centering
 \includegraphics[width=.8\linewidth]{img/nhanes_GBDT_shap_orig_c.png}
   \caption{Global Shapley impact obtained with \texttt{TREESHAP\_PATH}.}
 \end{subfigure}
 \begin{subfigure}[t]{.49\textwidth}
 \centering
 \includegraphics[width=.8\linewidth]{img/nhanes_GBDT_banzhaf_fast.png}
   \caption{Global Banzhaf impact obtained with \ban{}.}
 \end{subfigure}
   \caption{The global impacts of individual features for the \nhgb{} dataset.}
 \label{fig:nhanes}
 \end{figure}
  \begin{figure}[!ht]
  \centering
 \begin{subfigure}[t]{.49\textwidth}
 \centering
 \includegraphics[width=.8\linewidth]{img/flights_DT_100_shap_orig_c.png}
   \caption{Global Shapley impact obtained with \texttt{TREESHAP\_PATH}.}
 \end{subfigure}
 \begin{subfigure}[t]{.49\textwidth}
 \centering
   \includegraphics[width=.8\linewidth]{img/flights_DT_100_banzhaf_fast.png}
   \caption{Global Banzhaf impact obtained with \ban{}.}
 \end{subfigure}
  \caption{The global impacts of individual features for the  \fldt{} dataset. We observe small differences in the ordering of less important features.}
 \label{fig:flightsdt}
 
 \end{figure}
 
 
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.49\linewidth}
\centering
\includegraphics[width=.8\linewidth]{img/boston_GBDT_shap_orig_c.png}
  \caption{The original Shapley value.}
\end{subfigure}
\begin{subfigure}[t]{.49\linewidth}
\centering
  \includegraphics[width=.8\linewidth]{img/boston_GBDT_banzhaf_fast.png}
  \caption{The Banzhaf valufe.}
\end{subfigure}
  \caption{The global impacts of the individual features for the \bosgb{} dataset. We observe that the plots are indistinguishable.}
\label{fig:boston}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/health_insurance_GBDT_shap_orig_c.png}
  \caption{The original Shapley value.}
\end{subfigure}
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/health_insurance_GBDT_banzhaf_fast.png}
  \caption{The Banzhaf value.}
\end{subfigure}
\caption{The features' global impacts for the \higb{} dataset. We observe that the plots are indistinguishable.}
\label{fig:health}
\end{figure}

\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/flights_GBDT_shap_orig_c.png}
  \caption{The original Shapley value.}
\end{subfigure}
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/flights_GBDT_banzhaf_fast.png}
  \caption{The Banzhaf value.}
\end{subfigure}
\caption{The global impacts of the individual features for the \flgb{} dataset. We observe small differences in the ordering.}
\label{fig:flights}
\end{figure}



\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth,height=6cm]{img/boston_DT_10_shap_orig_c.png}
  \caption{The original Shapley value.}
\end{subfigure}
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth,height=6cm]{img/boston_DT_10_banzhaf_fast.png}
  \caption{The Banzhaf value.}
\end{subfigure}
  \caption{The global impacts of the individual features for the \bosdt{} dataset. We observe minor differences between plots.}
\label{fig:bostondt}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/health_insurance_DT_60_shap_orig_c.png}
  \caption{The original Shapley value.}
\end{subfigure}
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/health_insurance_DT_60_banzhaf_fast.png}
  \caption{The Banzhaf value.}
\end{subfigure}
\caption{The features' global impacts for the \hidt{} dataset. The plots are almost indentical.}
\label{fig:healthdt}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/nhanes_DT_40_shap_orig_c.png}
  \caption{The original Shapley value.}
\end{subfigure}
\begin{subfigure}[t]{.49\textwidth}
\centering
\includegraphics[width=.8\linewidth]{img/nhanes_DT_40_banzhaf_fast.png}
  \caption{The Banzhaf value.}
\end{subfigure}
  \caption{The global impacts of the individual features for the \nhdt{} dataset. The plots are almost identical.}
\label{fig:nhanesdt}
\end{figure}

\clearpage

\section{Per-feature MAD and RMSD Differences}\label{a:errors}


\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.6\textwidth}
\centering
\includegraphics[width=\linewidth]{img/boston_GBDT_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\hspace*{2mm}
\begin{subfigure}[t]{.6\textwidth}
\centering
\includegraphics[width=\linewidth]{img/boston_GBDT_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD differences between the Banzhaf value and the Shapley value for the \bosgb{} dataset.}
\label{fig:bostonl}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/nhanes_GBDT_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/nhanes_GBDT_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD differences between the Banzhaf value and the Shapley value for the \nhgb{} dataset.}
\label{fig:nhanesgb}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/health_insurance_GBDT_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/health_insurance_GBDT_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD differences between the Banzhaf value and the Shapley value for the \higb{} dataset.}
\label{fig:higb}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/flights_GBDT_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/flights_GBDT_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD differences between the Banzhaf value and the Shapley value for the \flgb{} dataset.}
\label{fig:flightsgb}
\end{figure}

\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.6\textwidth}
\centering
\includegraphics[width=\linewidth]{img/boston_DT_10_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\hspace*{2mm}
\begin{subfigure}[t]{.6\textwidth}
\centering
\includegraphics[width=\linewidth]{img/boston_DT_10_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD differences between the Banzhaf value and the Shapley value for the \bosdt{} dataset.}
\end{figure}

\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/nhanes_DT_40_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/nhanes_DT_40_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD differences between the Banzhaf value and the Shapley value for the \nhdt{} dataset.}
\label{fig:nhanesl}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/health_insurance_DT_60_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/health_insurance_DT_60_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD differences between the Banzhaf value and the Shapley value for the \hidt{} dataset.}
\label{fig:hidt}
\end{figure}
\begin{figure}[!ht]
 \centering
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/flights_DT_100_l1.png}
  \caption{MAD difference.}
\end{subfigure}
\begin{subfigure}[t]{.8\textwidth}
\centering
\includegraphics[width=\linewidth]{img/flights_DT_100_l2.png}
  \caption{RMSD difference.}
\end{subfigure}
\caption{The MAD and RMSD difference between the Banzhaf value and the Shapley value for the \fldt{} dataset.}
\label{fig:flightsl}
\end{figure}
\clearpage
\section{Further Related Work}
\label{section:other}
%\todo{Add two AAAI papers}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Feature importance values summarize a complicated ensemble model and
provide insight into what features drive the model's prediction. There can be various types of explanation methods to compute such values: model-dependent or model-agnostic methods, global or local explanation methods.
\paragraph{Explanation methods for trees.} Global feature importance values are
computed for an entire dataset in mainly three different ways. The basic global
approach, \emph{Split Count}, is to count the number of times a feature is used
for splitting \citep{ChenG16}. However, this method fails to account for the impacts
of different splits. The \emph{Gain} approach to feature importance \citep{Breiman}
is to attribute the reduction of loss contributed by each split in each decision tree
and it is widely used as the basis for feature selection methods~\citep{Chebrolu2005,HuynhThu2010InferringRN,Sandri2008ABC}. Another commonly used approach,
\emph{Permutation}, is to randomly permute the data column corresponding to a
feature in the test set and observe the change in the model's loss \citep{Breiman2004}.
If the model is heavily dependent on the feature then permuting it should create a
large increase in the model's loss.
These approaches are designed to estimate the global importance of a feature
over an entire dataset, so they are not directly applicable to local explanations
that are specific to each prediction. Local explanation methods for computing
feature importance values for a single prediction are not well studied for trees.
Only a couple of tree-specific local explanation methods were known previously.
One is to just report the decision path, which is not useful for large tree ensembles.
The other one is by \citet{Saabas} which is a heuristic method that measures
the difference in the model's expected output. The Saabas method explains a prediction
by following the decision path of the current input and attributing the differences in the
expected output of the model to each of the features along the path. The expected value
of every node in the tree is the average of the model output over the training samples
going through that node. For explaining an ensemble model made of many
trees, the Saabas value for the ensemble is defined as the sum of the values for each tree.
%
As noted in \citep{lundberg2018consistent}, the feature importance values from the gain,
split count, and Saabas methods are all inconsistent i.e., a model can be modified so that
it relies more on a given feature, yet the importance assigned to that feature decreases.

\paragraph{Model-agnostic methods.} One of the most common local explanation
methods in deep learning literature is to take the gradient of the model's output with
respect to its inputs at the current sample or multiplying the gradient times the value
of the input features. As depending entirely on the gradient of the model at a single
point can often be misleading \citep{Shrikumar2016NotJA} various other methods
have also been proposed~\citep{Springenberg2015StrivingFS,Zeiler2014VisualizingAU,Bach2015OnPE,Shrikumar2016NotJA,Kindermans2018LearningHT,Ancona2018TowardsBU}.
%
Model-agnostic methods on the other hand make no assumptions about the internal
structure of the model and depend on the relationship between changes in the model
inputs and model outputs. This is achieved by training a global mimic model to approximate
the original model, then locally explaining the mimic model \citep{Baehrens10a,Plumb18}.
Alternatively, the mimic model can be fit into the original model locally for each prediction.
In the LIME method~\citep{Ribeiro2016WhySI} the coefficients are used as an explanation
for a local linear mimic model. In \citep{ribeiro2018anchors} the rules are used
as the explanation for a local decision rule mimic model.
Recently, several methods for the local explanation of model predictions (such as
LIME~\citep{Ribeiro2016WhySI}, DeepLIFT \citep{Shrikumar2016NotJA,shrikumar17a},
Layer-wise Relevance Propagation~\citep{Bach2015OnPE}, and three methods from
cooperative game theory: Shapley regression values \citep{Lipovetsky2001AnalysisOR},
Shapley sampling values~\citep{StrumbeljK14}, and Quantitative Input
Influence~\citep{Datta2016AlgorithmicTV}) are unified into a single class of
\emph{additive feature attribution methods}~\citep{Lundberg2017}. This class contains
methods that explain a model's output as a sum of real values attributed to each input
feature. It is of particular interest as there is a unique optimal explanation approach in
the class that satisfies three desirable properties: local accuracy, missingness, and
consistency~\citep{Roth,Shapley53}. \emph{Local accuracy} (also called \emph{Efficiency}
or \emph{Completeness}) means that the sum of the feature attributions is equal to the
output of the function we want to explain. \emph{Missingness} (also called \emph{Sensitivity},
or \emph{Null-player axiom}) means that missing features are given no importance and
\emph{Consistency} (also called \emph{Monotonicity}) means that if a feature has a larger
impact on the model after a change then the attribution assigned to that feature can only increase.
%
One can use model-agnostic local explanation methods to explain tree models however
their dependence on post-hoc modeling of an arbitrary function can make them slow or
might suffer from sampling variability for models with many input features~\citep{Lundberg2020}.
Although such methods are often practical for individual explanations, but can quickly
become impractical for explaining entire datasets.


\section{Omitted proofs}\label{a:omitted}

\laddfeature*
\begin{proof}
  The proof proceeds by induction on the depth of $v$ in~$\tr$.
  The claim holds obviously for $v=\rho$.
  So suppose $v$ is non-root.

  Assume first that $d_{p_v}\notin Q\cup\{y\}$. Then, by applying the definition of $P[\cdot,\cdot]$
  twice,
  and the induction hypothesis:
  \begin{align*}
    P[v,Q\cup\{y\}]&=P[p_v,Q\cup\{y\}]\cdot \frac{r_v}{r_{p_v}}\\
                   &=P[p_v,Q]\cdot\Delta_{p_v,y}\cdot\frac{r_v}{r_{p_v}}\\
    &=P[v,Q]\cdot \frac{r_{p_v}}{r_v}\cdot \Delta_{p_v,y}\cdot\frac{r_v}{r_{p_v}}\\
    &=P[v,Q]\cdot \Delta_{v,y}.
  \end{align*}
  Otherwise, $d_{p_v}\in Q\cup \{y\}$.
  Assume wlog. $v=a_{p_v}$ -- the case $v=b_{p_v}$ is symmetric.
  We have:
  \begin{align*}
    P[v,Q\cup\{y\}]&=P[p_v,Q\cup\{y\}]\cdot [x_{d_{p_v}}< t_{p_v}]\\
                   &=P[p_v,Q]\cdot \Delta_{p_v,y}\cdot [x_{d_{p_v}}< t_{p_v}]
  \end{align*}
  If $d_{p_v}=y$, then $d_{p_v}\notin Q$ and we have
  $\Delta_{v,y}=\Delta_{p_v,y}\cdot [x_y<t_{p_v}]\cdot \frac{r_v}{r_{p_v}}$.
  So in that
  case
  \begin{align*}
    P[v,Q\cup\{y\}]&=P[p_v,Q]\cdot \Delta_{v,y}\cdot\frac{r_{p_v}}{r_v}
    = P[v,Q]\cdot \Delta_{v,y}.
  \end{align*}
  If, on the other hand, we have $y\neq d_{p_v}\in Q$, then:
  \begin{align*}
    P[v,Q\cup\{y\}]&=P[p_v,Q]\cdot \Delta_{p_v,y}\cdot [x_{d_{p_v}}< t_{p_v}]
    =P[v,Q]\cdot \Delta_{v,y}.\qedhere
  \end{align*}
\end{proof}

\ldpban*
\begin{proof}
  Let $m=|G|$. By the definition and Lemma~\ref{l:add_feature}, we have:
  \begin{align*}\beta(v,G\cup\{y\})&=\sum_{S\subseteq G\cup\{y\}}\frac{1}{2^{m+1}}P[v,S]\\
    &=\sum_{S\subseteq G} \frac{1}{2^{m+1}}P[v,S]
    +\sum_{y\in S\subseteq G\cup\{y\}} \frac{1}{2^{m+1}}P[v,S]\\
    &=\sum_{S\subseteq G} \frac{1}{2} \cdot \frac{1}{2^m} P[v,S]
    +\sum_{S\subseteq G} \frac{1}{2}\cdot\Delta_{v,y}\cdot \frac{1}{2^m} P[v,S]\\
&=\frac{1}{2}\left(1+\Delta_{v,y}\right)\beta(v,G). \qedhere
\end{align*}
\end{proof}

\lliftban*
\begin{proof}
  The claim follows easily by the definition of $\beta(v,Q)$ and
  since $P[v,G]=P[p_v,G]\cdot \frac{r_v}{r_{p_v}}$ holds for
  every subset $G\subseteq Q$.
\end{proof}

\lreduce*
\begin{proof}
By Lemmas~\ref{l:add_feature}~and~\ref{l:dp-ban}, we have:
\begin{align*}
  \beta_i&=\sum_{l\in\lvs(\tr)}f(l)\cdot\left(\frac{1}{2^{n-1}}\sum_{S\subseteq U\setminus\{i\}}\left(P[l,S\cup \{i\}]-P[l,S]\right)\right)\\
  &=\sum_{l\in\lvs(\tr)}f(l)\cdot\left(\left(\frac{\Delta_{l,i}}{2^{n-1}}\sum_{S\subseteq U\setminus\{i\}}P[l,S]\right)-\beta(l,U\setminus\{i\})\right)\\
  &=\sum_{l\in\lvs(\tr)}f(l)\cdot\left(\Delta_{l,i}\cdot \beta(l,U\setminus\{i\})-\beta(l,U\setminus\{i\})\right)\\
    &=\sum_{l\in\lvs(\tr)}2f(l)\cdot\left(\beta(l,U)-\beta(l,U\setminus\{i\})\right).
\end{align*}
Note that if $y\notin F_l$, then $\Delta_{l,y}=1$, and thus by Lemma~\ref{l:dp-ban},
we have $\beta(l,X\cup \{y\})=\beta(l,X)$ for any $X\subseteq U\setminus \{y\}$. Inductively we obtain
$\beta(l,X\cup Y)=\beta(l,X)$ for any $Y\subseteq U\setminus X\setminus F_l$.
In particular, we obtain $\beta(l,U)=\beta(l,F_l)$, and $\beta(l,U\setminus \{i\})=\beta(l,F_l\setminus \{i\})$.
To finish the proof, observe that if $i\notin F_l$, then
$\beta(l,F_l)=\beta(l,F_l\setminus\{i\})$ by Lemma~\ref{l:dp-ban}, so
for such $i$ the summand above will be equal to $0$.
\end{proof}

\lbcontribs*
\begin{proof}
  Recall from the proof of Lemma~\ref{l:reduce} that
  \begin{align*}
  \beta_i&=\sum_{\substack{l\in\lvs(\tr)\\i\in F_l}}f(l)\cdot(\Delta_{l,i}-1)\cdot \beta(l,F_l\setminus\{i\}).
\end{align*}
By changing the order of summation, we equivalently have:
\begin{align*}
  \beta_i&=\sum_{\substack{v\in\tr\\d_{p_v}=i}}\sum_{l\in \lvs_v}f(l)\cdot (\Delta_{l,i}-1)\cdot \beta(l,F_l\setminus\{d_{p_v}\})\\
  &=\sum_{\substack{v\in\tr\\d_{p_v}=i}}\sum_{l\in \lvs_v}f(l)\cdot (\Delta_{l,i}-1)\cdot \frac{2}{1+\Delta_{l,i}}\cdot \beta(l,F_l).
\end{align*}
and the lemma follows by the definition of $B(v)$ and $\Delta_{l,i}=\Delta_{v,i}$.
\end{proof}

\section{Improved algorithm for Shapley attributions}\label{a:shapley}
In this section we sketch the changes that need to be made to the algorithms of Section~\ref{s:algo} to make it compute Shapley value-based explanations
as given by~\eqref{eq:shap}.

We use intermediate values $\phi(\cdot,\cdot,\cdot)$ analogous to the values
$\beta(\cdot,\cdot)$ that constituted the base of the Banzhaf algorithm.
For any vertex $v\in \tr$, set $G\subseteq U$ and integer $k=0,\ldots,|G|$, let
\begin{equation}\label{eq:dpdef}
  \phi(v,G,k):=\frac{1}{|G|+1}\sum_{\substack{S\subseteq G\\|S|=k}} \binom{|G|}{k}^{-1}\cdot P[v,S].
\end{equation}
Let us also put $\phi(v,G)$ to be a vector consisting of all the values $\phi(v,G,\cdot)$:
\begin{equation*}
  \phi(v,G):=\left(\phi(v,G,k)\right)_{k=0}^{|G|}.
\end{equation*}

We have the following analogues of Lemma~\ref{l:dp-ban}~and~Lemma~\ref{l:lift-ban}, respectively.
For convenience, let us define $\phi(v,G,k)=0$ for $k<0$ or $k>|G|$.

\begin{lemma}\label{l:dp}
Let $v\in \tr$, $G\subseteq U$ and $k\in\{0,\ldots,|G|\}$. Let $y\in U\setminus G$. Then:
  \begin{align*}
    \phi(v,G\cup\{y\},k)&=\frac{|G|+1-k}{|G|+2}\cdot\phi(v,G,k)+\frac{k}{|G|+2}\cdot \Delta_{v,y}\cdot \phi(v,G,k-1).
  \end{align*}
\end{lemma}
\begin{proof}
  Let $m=|G|+1$.
  %and $X=\phi(v,G\cup\{y\},k)$.
  By Lemma~\ref{l:add_feature} we get:
  \begin{align*}\phi(v,G\cup\{y\},k)&=\sum_{\substack{S\subseteq G\cup\{y\}\\|S|=k}}\frac{1}{m+1} \binom{m}{k}^{-1} P[v,S]\\
    &=\left(\sum_{\substack{S\subseteq G\\|S|=k}} \frac{1}{m+1}\binom{m}{k}^{-1} P[v,S]\right)
    +\left(\sum_{\substack{y\in S\subseteq G\cup\{y\}\\|S|=k}} \frac{1}{m+1}\binom{m}{k}^{-1} P[v,S]\right)\\
    &=\left(\sum_{\substack{S\subseteq G\\|S|=k}} \frac{m-k}{m+1} \cdot \frac{1}{m}\binom{m-1}{k}^{-1} P[v,S]\right)
    +\left(\sum_{\substack{S\subseteq G\\|S|=k-1}} \frac{k}{m+1}\cdot\Delta_{v,y}\cdot \frac{1}{m}\binom{m-1}{k-1}^{-1} P[v,S]\right)\\
&=\frac{m-k}{m+1}\cdot \phi(v,G,k)+
\frac{k}{m+1}\cdot \Delta_{v,y}\cdot \phi(v,G,k-1).\qedhere
\end{align*}
\end{proof}
\begin{lemma}\label{l:lift}
  Let $v\in \tr$ be a non-root node and let ${Q\subseteq U\setminus \{d_{p_v}\}}$. Then, for all $k$,
  \begin{equation*}
    \phi(v,Q,k)=\phi(p_v,Q,k)\cdot\frac{r_v}{r_{p_v}}.
  \end{equation*}
\end{lemma}
\begin{proof}
  The claim follows easily by the definition of $\phi(v,Q,k)$ and
  since $P[v,G]=P[p_v,G]\cdot \frac{r_v}{r_{p_v}}$ holds for
  every subset $G\subseteq Q$.
\end{proof}

Let $\Phi(v,G)$ be the sum of individual coordinates of the vector
$\phi(v,G)$, i.e., $\Phi(v,G):=\sum_{k=0}^{|G|}\phi(v,G,k)$.

The following lemma states an intuitive fact that $\Phi(v,G)$ does
not depend on the features in $G$ that do not appear
in the ancestors of $v$.

\begin{lemma}\label{l:sum}
Let $v\in \tr$ and $G\subseteq U$. Suppose $y\in U\setminus G\setminus F_v$. Then:
  \begin{equation*}
  \Phi(v,G\cup \{y\})=\Phi(v,G).
\end{equation*}
\end{lemma}
\begin{proof}
Recall that $y\in U\setminus G\setminus F_v$ and thus
  $\Delta_{v,y}=1$. By Lemma~\ref{l:dp}, we have:
\begin{align*}
\Phi(v,G\cup \{y\}) &= \sum_{k=0}^{|G|+1}\phi(v,G\cup\{y\},k)\\
                  &=\sum_{k=0}^{|G|+1}\frac{|G|+1-k}{|G|+2}\cdot \phi(v,G,k)
                  +\sum_{k=0}^{|G|+1}\frac{k}{|G|+2}\cdot \phi(v,G,k-1)\\
                  &=\sum_{k=0}^{|G|}\frac{|G|+1-k}{|G|+2}\cdot \phi(v,G,k)
                  +\sum_{k=0}^{|G|}\frac{k+1}{|G|+2}\cdot \phi(v,G,k)\\
                  &=\sum_{k=0}^{|G|}\phi(v,G,k)\\
                  &=\Phi(v,G).\qedhere
\end{align*}
\end{proof}

The following is an analogue of Lemma~\ref{l:reduce} for Shapley value that reduces computing
the Shapley explanation $(\phi_i)_{i\in U}$ to computing
the vectors of the form $\phi(l,F_l\setminus \{i\})$
for all pairs $(l,i)\in \lvs(\tr)\times U$ with $i\in F_l$.



\begin{lemma}\label{l:reduce-shap}
For any $i\in U$, we have:
\begin{equation*}
\phi_i=\sum_{\substack{l\in\lvs(\tr)\\i\in F_l}}f(l)\cdot \left(\Delta_{l,i}-1\right)\cdot \Phi(l,F_l\setminus\{i\}).
\end{equation*}
\end{lemma}
\begin{proof}
By expanding the sum~(\ref{eq:shap}) using~\eqref{eq:tpd}, we obtain:
\begin{align*}
  \phi_i&=\frac{1}{n}\sum_{S\subseteq U\setminus\{i\}} {\binom{n-1}{|S|}}^{-1}\left(g(S\cup\{i\})-g(S)\right)\\
  &=\frac{1}{n}\sum_{S\subseteq U\setminus\{i\}} {\binom{n-1}{|S|}}^{-1}\left(\sum_{l\in\lvs(\tr)}f(l)\left(P[l,S\cup\{i\}]-P[l,S]\right)\right)\\
\end{align*}
By subsequently applying Lemma~\ref{l:add_feature}, and changing the summation order, we have:
\begin{align*}
  \phi_i&=\frac{1}{n}\sum_{S\subseteq U\setminus\{i\}} {\binom{n-1}{|S|}}^{-1}\left(\sum_{l\in\lvs(\tr)}f(l)\cdot P[l,S]\left(\Delta_{l,i}-1\right)\right)\\
  &=\sum_{l\in\lvs(\tr)}f(l)\cdot \left(\Delta_{l,i}-1\right)\left(\frac{1}{n}\sum_{k=0}^{n-1}\sum_{\substack{S\subseteq U\setminus\{i\}\\|S|=k}} {\binom{n-1}{k}}^{-1}P[l,S]\right)\\
  &=\sum_{l\in\lvs(\tr)}f(l)\cdot \left(\Delta_{l,i}-1\right)\cdot \Phi(l,U\setminus\{i\})\\
\end{align*}
Since $\left(\Delta_{l,i}-1\right)=0$ when $i\notin F_l$,
we actually have:
\begin{equation*}
  \phi_i=\sum_{\substack{l\in\lvs(\tr)\\i\in F_l}}f(l)\cdot \left(\Delta_{l,i}-1\right)\cdot \Phi(l,F_l\setminus\{i\}).\qedhere
\end{equation*}
\end{proof}
The recursive formulas of Lemmas~\ref{l:dp}~and~\ref{l:lift} allow computing each $\phi(v,F_v)$ out of a ``neighboring'' vector
$\phi(p_v,F_{p_v})$ in $O(|F_v|)=O(D)$ time.
This overhead arises from the fact that the used vectors $\phi(\cdot,\cdot)$ have
up to $D$ coordinates. Recall that when computing Banzhaf value explanations,
similar values had only a single coordinate and hence a similar
transition could be performed in constant time.
Consequently, the basic algorithm of Section~\ref{s:algo-basic} adjusted
to compute the vectors $\phi(v,F_v)$ takes $O(LD^2)$ time,
which matches the bound achieved by~\citet{Lundberg2020}.

To obtain an asymptotically faster $O(LD)$ time algorithm for computing Shapley explanations using the approach of Section~\ref{s:algo-opt},
we need to devise a Shapley-analogue of Lemma~\ref{l:b-contribs}.
To this end, consider the following values:
\begin{equation*}
    \Psi(v,k)=\sum_{l\in \lvs_v}f(l)\cdot \phi(l,F_l,k),
\end{equation*}
that are analogues of the values $B(v)$ from Section~\ref{s:algo-opt}.
By proceeding similarly as in Section~\ref{s:algo-opt}, a bottom-up computation
can be used to compute all the values $\Psi(v,k)$ for $v\in \tr$ in $O(LD)$ time.

Let us also set:
\begin{align*}
  \gamma(v,k)&:=\sum_{l\in \lvs_v}f(l)\cdot \phi(l,F_l\setminus \{d_{p_v}\},k)\\
  \Gamma(v)&:=\sum_{l\in \lvs_v}f(l)\cdot \Phi(l,F_l\setminus \{d_{p_v}\}).
\end{align*}
Therefore, we can rewrite~Lemma~\ref{l:reduce-shap} as follows:
\begin{align*}
  \phi_i&=\sum_{\substack{l\in\lvs(\tr)\\i\in F_l}}f(l)\cdot \left(\Delta_{l,i}-1\right)\cdot \Phi(l,F_l\setminus\{i\})\\
  &=\sum_{\substack{v\in\tr\\d_{p_v}=i}}\sum_{l\in\lvs_v}f(l)\cdot \left(\Delta_{l,i}-1\right)\cdot \Phi(l,F_l\setminus\{i\})\\
  &=\sum_{\substack{v\in\tr\\d_{p_v}=i}} \left(\Delta_{v,i}-1\right)\sum_{l\in\lvs_v}f(l)\cdot \Phi(l,F_l\setminus\{i\})\\
  &=\sum_{\substack{v\in\tr\\d_{p_v}=i}} \left(\Delta_{v,i}-1\right)\cdot \Gamma(v).
\end{align*}
Note that the above derivation provides an $O(L)$-time
reduction of computing all $\phi_i$ to computing
all values $\Gamma(v)$.
Those can be clearly obtained by simple summation in $O(LD)$ time once we have all
the values $\gamma(v,k)$.

The following lemma, analogous to Lemma~\ref{l:dp}, gives a relationship between
values $\Psi(\cdot,\cdot)$ and $\gamma(\cdot,\cdot)$.

\begin{lemma}\label{l:dp2}
Let $v\in\tr$, $v\neq \rho$.
Suppose the sets $F_l$ have equal sizes $s$ for all $l\in \lvs_v$.
Then, for any $k=0,\ldots,s$, we have:
\begin{align*}
  &\Psi(v,k)=\frac{s-k}{s+1}\cdot \gamma(v,k) + \frac{k}{s+1}\cdot \Delta_{v,d_{p_v}}\cdot \gamma(v,k-1).
\end{align*}
\end{lemma}
\begin{proof}
By Lemma~\ref{l:dp}, for any $l\in\lvs_v$ we have:
  \begin{align*}
    \phi(l,F_l,k)&=\frac{|F_l|-k}{|F_l|+1}\phi(l,F_l\setminus\{d_{p_v}\},k)-\frac{k}{|F_l|+1}\cdot \Delta_{l,d_{p_v}}\cdot \phi(l,F_l\setminus\{d_{p_v}\},k-1)\\
 \phi(l,F_l,k)&=\frac{s-k}{s+1}\phi(l,F_l\setminus\{d_{p_v}\},k)-\frac{k}{s+1}\cdot \Delta_{l,d_{p_v}}\cdot \phi(l,F_l\setminus\{d_{p_v}\},k-1).
  \end{align*}
We obtain the desired equality by summing the above
through all $l\in\lvs_v$ and using $\Delta_{l,d_{p_v}}=\Delta_{v,d_{p_v}}$.
\end{proof}

Lemma~\ref{l:dp2} would suffice to compute all the needed values $\gamma(v,k)$
if only all the sets $F_l$, $l\in \lvs_v$ had equal sizes for each vertex $v\in\tr$.
Unfortunately, this is not true in general.
To deal with this problem, we need to make a subtle change
to the algorithm.
Ideally, we would like all the sets $F_l$ for $l\in\lvs(\tr)$ have the same size $D$,
where $D$ is the maximum size of $F_l$ in the input tree.
This
could be ensured, for example, by extending all smaller $F_l$ with $D-|F_l|$
distinct dummy features that do not appear in $F_l$ -- recall from Lemma~\ref{l:sum}
that adding dummy features does not change $\phi(v,G)$,
for any $G\subseteq U$, so it does not influence our results.
Unfortunately, adding a dummy feature to $F_l$
by simply using Lemma~\ref{l:dp} costs $\Theta(D)$ time.
Therefore, if $T$ was very
unbalanced, padding all $F_l$ could cost as much as $\Theta(LD^2)$ time.
We thus need a smarter approach.

Instead, let $q_1,\ldots,q_D$ be distinct artificial features \emph{not} appearing
in the nodes of $\tr$. For \emph{all} $v\in \tr$ let us define
\begin{equation*}
  F^*_v=F_v\cup \{q_1,\ldots, q_{D-|F_v|}\}.
\end{equation*}
Observe that then $F^*_\rho=\{q_1,\ldots,q_D\}$ for the root $\rho$ of $\tr$, and for each
non-root $v$ we have
\begin{equation}\label{eq:fstar}
  F^*_v=\begin{cases}F^*_{p_v}&\text{ if }d_{p_v}\in F_{p_v}\\
    F^*_{p_v}\setminus\{q_{D-|F_{p_v}|}\}\cup \{d_{p_v}\}&\text{ otherwise.}
\end{cases}
\end{equation}
With sets $F^*_v$ defined like this, $v\in\tr$, by Lemma~\ref{l:sum}, we have:
\begin{equation*}
  \Phi(v,F_v\setminus\{d_{p_v}\})=\Phi(v,F_v^*\setminus\{d_{p_v}\}),
\end{equation*}
and consequently:
\begin{equation*}
  \Gamma(v)=\sum_{l\in \lvs_v}f(l)\cdot \phi(l,F_l^*\setminus\{d_{p_v}\}).
\end{equation*}

It is thus enough to modify the basic algorithm computing all the vectors $\psi(v,F_v)$
so that it computes all the vectors $\phi(v,F_v^*)$ instead.
It is very easy to achieve that.
First of all, the initial vector $\phi(\rho,F^*_\rho)$
is initialized in $O(D^2)=O(LD)$ time by applying
Lemma~\ref{l:dp} $D$ times.
By~\eqref{eq:fstar}, for each non-root $v$, the vector $\phi(v,F_v^*)$ can be still obtained
from the vector $\phi(p_v,F_{p_v}^*)$ in $O(D)$ time as before
using $O(1)$ applications of Lemmas~\ref{l:dp}~and~\ref{l:lift}.



\section{Proof of Lemma~\ref{l:numeric-bound}}\label{s:numeric-bound}

Let us first argue that indeed moving between nearby values $\beta(v,G)$
boils down to $O(1)$ multiplications/divisions of some value $\beta(v,G)$ with
a number between $0.5$ and $r_\rho$.
Indeed, if Lemma~\ref{l:dp-ban} is used, then $\beta(v,G)$ is multiplied
by a number that is at least $0.5$ (if $x_f\notin I_{v,f}$), and at
most $(1+1/c_v(f))/2\leq (1+1/(1/r_\rho))/2\leq (1+r_\rho)/2\leq r_\rho$,
since the coverages $r_v$ are positive integers.
On the other hand, Lemma~\ref{l:lift-ban} requires
a single multiplication via a number of the form $r_v/r_{p_v}$, which
translates to two multiplications/divisions by an integer between $1$ an $r_\rho$.

Let $\epsilon<0.1$ be the machine epsilon.
Using a well-established model of floating point numbers (see e.g.,~\cite{higham}) ,
we can assume that the floating point number representation $\text{fl}(x\circ y)$
of the result of an arithmetic operation $\circ$ on two \emph{exactly represented}
numbers $x,y$ satisfies $\text{fl}(x\circ y)=(x\circ y)(1+\delta)$ for some $|\delta|\leq \epsilon$.
In particular, if $x,y>0$ and $\circ\in \{+,\cdot,/\}$, then we have
\begin{align*}
  \text{fl}(x\circ y)&\leq (x\circ y)(1+\epsilon),\\
  \text{fl}(x\circ y)&\geq (x\circ y)(1-\epsilon)\geq (x\circ y)\cdot \frac{1}{(1+\epsilon)^2}.
\end{align*}
We can thus conclude, that if $x'>0$ is floating-point approximation of a value $x>0$ with multiplicative
error between $(1+\epsilon)^{-k}$ and $(1+\epsilon)^k$, and $y'>0$ is a floating-point approximation of a  value $y>0$
with multiplicative error between $(1+\epsilon)^{-l}$ and $(1+\epsilon)^l$, then for any $\circ\in \{+,\cdot,/\}$
we have:
\begin{align*}
  \text{fl}(x'\circ y')&\leq (x'\circ y')(1+\epsilon)\leq (x\circ y)(1+\epsilon)^{\max(k,l)}\cdot (1+\epsilon)=(x\circ y)(1+\epsilon)^{\max(k,l)+1},\\
  \text{fl}(x'\circ y')&\geq (x'\circ y')(1-\epsilon)\geq (x\circ y)\frac{1}{(1+\epsilon)^{\max(k,l)}}\cdot \frac{1}{(1+\epsilon)^2}=(x\circ y)\frac{1}{(1+\epsilon)^{\max(k,l)+2}}.
\end{align*}
More generally, evaluating an arithmetic expression on positive numbers
that involves only additions, multiplications, or divisions, built of $k$ such operations,
has multiplicative error at most $(1+\epsilon)^{O(k)}$ and at least $(1+\epsilon)^{-O(k)}$.
Note that this implies a relative error of $(1+\epsilon)^{O(k)}-1$.
Indeed, if the expression evaluates to $z'>0$ and its true value is $z>0$, then
if $z'\geq z$, we get
\begin{equation*}
  \frac{|z'-z|}{z}=\frac{z'-z}{z}\leq \frac{(1+\epsilon)^{O(k)}z-z}{z}= (1+\epsilon)^{O(k)}-1,
\end{equation*}
whereas if $z'\leq z$, then we get (by applying, in the final step, the inequality $x+1/x\geq 2$ valid for all $x>0$):
\begin{equation*}
  \frac{|z'-z|}{z}=\frac{z-z'}{z}\leq \frac{z-(1+\epsilon)^{-O(k)}z}{z}=1-\frac{1}{(1+\epsilon)^{O(k)}}\leq (1+\epsilon)^{O(k)}-1.
\end{equation*}

Finally, note that each value $\beta(v,G)$ for $v$ at depth $d$, can be expressed, by an inductive application of Lemmas~\ref{l:dp-ban}~and~\ref{l:lift-ban}, using
a formula with $O(d+|G|)$ multiplications and divisions and input values in the range $[0.5,r_\rho]$,
all of which can be represented as floating point numbers with multplicative error between $(1+\epsilon)^{-1}$ and $(1+\epsilon)$.
As a result, since $d\leq D$, and $|F_l|\leq D$, for any leaf $l$, and any $i\in F_l$, 
$\beta(l,F_l)$ is computed using $O(D)$ applications of Lemmas~\ref{l:dp-ban}~and~\ref{l:lift-ban}, and thus 
with relative error $(1+\epsilon)^{O(D)}-1$.

\fi
\end{document}