\documentclass[accepted]{uai2022} 

%% Choose your variant of English; be consistent
\usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib}
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools}
\usepackage{amsfonts}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{tikz}
\usepackage{multirow}
\usepackage[ruled]{algorithm2e}
\usepackage{xr}
\externaldocument{perez_396} 

% Add-on Math stuff
\newtheorem*{remark}{Remark}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\title{Attribution of Predictive Uncertainties in Classification Models \\ (Supplementary material)}

% Add authors
\author{Iker Perez}
\author{Piotr Skalski}
\author{Alec Barns-Graham}
\author{Jason Wong}
\author{David Sutton}
\affil{%
    Featurespace Research\\
    Cambridge\\
    United Kingdom
}

\begin{document}
\maketitle


\section{Bayesian Presentation} \label{app:Bayesian}

In Bayesian settings, model uncertainties are often decomposed across \textit{aleatoric} and \textit{epistemic} components that help scrutinise different aspects in the functioning of a model, and can facilitate interpretability or fairness assessments in important machine learning applications \citep{awasthi2021evaluating}. Hence we may wish to offer attributions that are representative of isolated types of uncertainties.

On training a neural classifier $f:\mathbb{R}^n \times \mathcal{W} \rightarrow \Delta^{|\mathcal{C}|-1}$ within an (approximate) Bayesian setting, we commonly obtain a \textit{posterior} over the hypothesis space of models, i.e. a distribution $\pi(\boldsymbol{w}|\mathcal{D})$ over model weights conditioned on the available train data $\mathcal{D}=\{\boldsymbol{x}_i, c_i\}_{i=1, 2,\dots}$. Popular approaches to procure such posterior often differ in their approach to incorporate \textit{prior} knowledge and include \textit{dropout} \citep{srivastava2014dropout}, \textit{Bayes-by-Backprop} \citep{blundell2015weight} or SG-HMC \citep{springenberg2016bayesian}. Here, a model score for a new data point $\boldsymbol{x}^\star\in\mathbb{R}^n$ is derived from the \textit{posterior predictive distribution} by marginalising over posterior weights, i.e.
\begin{equation*}
\pi(\boldsymbol{x}^\star|\mathcal{D})=\int_{\mathcal{W}} f(\boldsymbol{x}^\star, \boldsymbol{w}) \pi(\boldsymbol{w}|\mathcal{D}) d\boldsymbol{w} = \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}[f(\boldsymbol{x}^\star, \boldsymbol{w})],
\end{equation*}
and is easily approximated as $\frac{1}{N} \sum_{i=1}^N f(\boldsymbol{x}^\star, \boldsymbol{w}_i)$, with weight samples $\boldsymbol{w_i}\sim\pi(\boldsymbol{w}|\mathcal{D})$, $i=1,\dots,N$. This setting is analogue to the presentation in Section \ref{sec:attributions}, however, the point estimate score must now be averaged over the posterior, i.e.  $f(\boldsymbol{x}) = \pi(\boldsymbol{x}^\star|\mathcal{D}) = \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}[f(\boldsymbol{x}^\star, \boldsymbol{w})]$

The \textit{entropy} is thus given by
\begin{equation*}
H(\boldsymbol{x}|\mathcal{D})=-\sum_{c\in\mathcal{C}} \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}[f_c(\boldsymbol{x}, \boldsymbol{w})] \cdot \log \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}[f_c(\boldsymbol{x}, \boldsymbol{w})],
\end{equation*}
and may be decomposed through the law of iterated variances \citep{kendall2017uncertainties} so as to yield an \textit{aleatoric} term 
\begin{align*}
H_a(\boldsymbol{x}|\mathcal{D}) & = \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}[H(\boldsymbol{x}, \boldsymbol{w})] \\
& = -\sum_{c\in\mathcal{C}} \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}\big[ f_c(\boldsymbol{x}, \boldsymbol{w}) \cdot \log f_c(\boldsymbol{x}, \boldsymbol{w})\big], 
\end{align*}
which measures the mean predictive entropy across models in the posterior hypothesis space, as well as the \textit{mutual information} or \textit{epistemic} term, $H_e(x|\mathcal{D}) = H(\boldsymbol{x}|\mathcal{D}) - H_a(\boldsymbol{x}|\mathcal{D})$ that represents model uncertainty projected into the latent membership vector $\pi(\boldsymbol{x}|\mathcal{D})$. Intuitively, aleatoric uncertainty represents natural stochastic variation in the observations over repeated experiments; on the other hand, epistemic uncertainty is descriptive of model unknowns due to inadequate data or inappropriate modelling choices.

\textbf{Path integrals}. The \textit{posterior predictive classifier} $\pi(\boldsymbol{x}|\mathcal{D})$ accepts a path importance for an arbitrary scalar output $F(\boldsymbol{x}, \boldsymbol{w})$ at index $i$, given by
\begin{equation*}
\text{attr}^\delta_i(\boldsymbol{x}) = \int_0^1 \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}\bigg[
\frac{\partial F(\delta(\alpha), \boldsymbol{w})}{\partial \delta_i(\alpha)} \bigg]
\frac{\partial \delta_i(\alpha)}{\partial \alpha} d\alpha.
\end{equation*}
This represents a \textit{mean-average} trajectory over a curve $\delta$ and follows from \textit{dominated convergence}. This easily amends to the attribution of uncertainties, i.e. 
\begin{equation*}
\text{attr}^\delta_i(\boldsymbol{x}) = - \sum_{c\in\mathcal{C}} \int_0^1 \Delta_i(\alpha) \frac{\partial \delta_i(\alpha)}{\partial \alpha} d\alpha
\end{equation*}
which is defined s.t. 
\begin{align*}
\Delta&_i(\alpha) = \\
&\big( 1 + \log \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}[f_c(\delta(\alpha), \boldsymbol{w})] \big) \cdot \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}\Big[\frac{\partial f_c(\delta(\alpha), \boldsymbol{w})}{\partial \delta_i(\alpha)} \Big]
\end{align*}
If we wish to only attribute aleatoric uncertainties, we may replace the above for
\begin{equation*}
\Delta_i(\alpha) = \mathbb{E}_{\boldsymbol{w}|\mathcal{D}}\Big[ \big( 1 + \log f_c(\delta(\alpha), \boldsymbol{w}) \big) \cdot \frac{\partial f_c(\delta(\alpha), \boldsymbol{w})}{\partial \delta_i(\alpha)} \Big].
\end{equation*}
Finally, attributions for any variation in epistemic uncertainty is readily shown to be \textit{explained} as the difference in attributions between full and aleatoric uncertainties. We showed an example of this in Figure \ref{dog_un_expl} within Section \ref{sec:attributions}.

\section{Robustness to Changes in the Autoencoder} \label{app:robustness}

\begin{table}[b!]
    \centering
    \caption{Performance metrics for generative attribution method, for architecture variations in the autoencoder.} \label{tab:robustness}
    \setlength{\tabcolsep}{6pt}
    \renewcommand{\arraystretch}{1.3}
    \resizebox{0.48\textwidth}{!}{
\begin{tabular}{l|lc|llcc|}
\multicolumn{1}{c|}{\multirow{3}{*}{Setting}} & \multicolumn{2}{c|}{Area over EIC}                                    & \multicolumn{4}{c|}{Uncertainty Reduction Curve}                                 \\ \cline{2-7} 
\multicolumn{1}{c|}{}                         & \multicolumn{1}{c}{\multirow{2}{*}{Mnist}} & \multirow{2}{*}{Fashion} & \multicolumn{2}{c}{Mnist}                         & \multicolumn{2}{c|}{Fashion} \\
\multicolumn{1}{c|}{}                         & \multicolumn{1}{c}{}                       &                          & \multicolumn{1}{c}{1\%} & \multicolumn{1}{c}{5\%} & 1\%           & 5\%          \\ \hline
4                                             & 0.999                                      & 0.918                    & 0.474                   & 0.738                   & 0.165         & 0.350        \\
8                                             & 0.999                                      & 0.916                    & 0.661                   & 0.845                   & 0.192         & 0.374        \\
16                                            & 0.999                                      & 0.919                    & 0.704                   & 0.846                   & 0.196         & 0.393        \\
32                                            & 0.999                                      & 0.925                    & 0.743                   & 0.868                   & 0.204         & 0.395        \\
- Aug                                         & \multicolumn{1}{c}{0.999}                  & 0.930                    & 0.687                   & 0.876                   & 0.184         & 0.403        \\
64                                            & 0.999                                      & 0.922                    & 0.752                   & 0.877                   & 0.203         & 0.392        \\
128                                           & 0.999                                      & 0.925                    & 0.756                   & 0.879                   & 0.206         & 0.400        \\
259                                           & 0.999                                      & 0.927                    & 0.762                   & 0.884                   & 0.204         & 0.405       
\end{tabular}}
\end{table}

In Table \ref{tab:robustness} we show evaluations of  performance metrics for the attribution method proposed in this paper, over resampled Mnist and Fashion validation images. We train multiple variational autoencoders and use them as the generative process to define integration paths in our method. These differ in the dimensionality of the latent space used to encode reduced representations of images. This is the most impactful layer for the functioning of the attribution method we have presented, since straight integration lines are defined in this space and later projected into pixel space. Too small or large a space could lead to out of distribution images and integration paths. Additionally, we also experiment with altering the data augmentation mechanism used for modifying images prior to training the autoencoder (results are reported at latent space dimension of $32$). No significant changes in performance where noticed as training regimes and learning rates were modified.

In the table, we notice consistent performance which plateaus after a certain threshold, which is equivalent in these two data sets. Consistency in performance is a consequence of the regularisation term in latent space observed in \eqref{find_fiducial}. This tunes fiducial points and integration paths strictly in distribution, even if large latent spaces overparametrise the encoding space.

\section{Examples} \label{app:Examples}

In Figure \ref{fig:bayAttr} we find examples of attributions of aleatoric and epistemic uncertainty types, applied to dog versus cats images. Attributions are produced by vanilla integrated gradients as described in Section \ref{sec:attributions}. Saliency masks are combined with a Gaussian kernel in order to draw attention to regions in images associated with different uncertainty types. Similarly, Figure \ref{fig:bayMnist} shows attributions of uncertainty types across selected Mnist images, produced by the generative method presented in this paper.

\begin{figure}[h!]
\centering
\includegraphics[width=0.48\textwidth]{images/extra_dogs-vs-cats.png}
\caption{Aleatoric and epistemic contributions to uncertainty for a classification task in \textit{dogs versus cats} data.} \label{fig:bayAttr}

\end{figure}
\begin{figure}[t!]
\centering
\includegraphics[width=0.48\textwidth]{images/MNIST_importances.png}
\caption{Aleatoric and epistemic contributions to uncertainty for a classification task with MNIST digits.} \label{fig:bayMnist}
\end{figure}

\subsection{Qualitative evaluations}

\begin{figure*}[t!]
\centering
\includegraphics[width=0.98\textwidth]{images/extra_qualitative.png}
\caption{Uncertainty attribution masks across multiple classification tasks and data sets. We display best performing attribution methods with a counterfactual mechanism.} \label{fig:extraQual}
\end{figure*}

Finally, in Figure \ref{fig:extraQual} we show uncertainty attribution masks across a range of classification tasks, on all the data sets explored in this paper. In all cases, we note that attributions relying on counterfactual mechanisms are humanly interpretable. Further integration of counterfactual methods with path integrals ensures that attributions are isolated to few pixels. In application to human gestures, these are always restricted to facial features around the mouth, cheeks or eyebrows, depending on the classification task. On the contrary, vanilla attributions through integrated gradients (averaged over \textit{black and white} fiducial baselines) are noticeably noisy. Also, segmentation based mechanisms do not perform well in the data sets we have explored, which do not contain multiple objects that can be easily segregated.

\section{Implementation details} \label{app:Impl}

All of our predictive models are implemented through Keras. The following is a summary of architectures, hyper-parameters, training regimes and further details.

\subsection{MNIST handwritten digits}

Our classifier is a convolutional neural network with \textit{max-pooling} layers and dropout, structured as:
\begin{itemize}
\item Two convolutional layers of kernel size $3\times 3$ and \textit{relu} activation; \textit{filter counts} are $32$ and $64$ for the first and second layers. We use \textit{stride} length of $1$ followed by \textit{max-pooling} layers of \textit{pool size} $2\times 2$.
\item The output is flattened and fed through a \textit{dense} layer of $128$ neurons with \textit{relu} activation, followed by dropout with deactivation rate of $0.5$, and a final \textit{softmax} regression layer for categorical outputs.
\end{itemize}
We train to minimize the \textit{categorical cross entropy} wrt the train labels, using the \textit{Adam} optimizer, over $10$ epochs, with a constant learning rate of $1e^{-3}$ and with \textit{batch size} of $32$.

The \textbf{variational autoencoder} relies on convolution and deconvolution layers. The encoder is structured as:
\begin{itemize}
\item Two convolutional layers of kernel size $3\times 3$, stride $2$ and \textit{relu} activation; \textit{filter counts} are $32$ and $64$ for the first and second layers. 
\item A \textit{dense} layer of $128$ neurons, with \textit{relu} activation.
\item Two \textit{dense} layers mapping the $128$ neurons to a distributional mean vector and a log-standard-deviation vector, for the latent space for an image. Dimension of the latent space varies in order to assess robustness, see Appendix \ref{app:robustness} for details.
\item A random \textit{sampling} operation from a normal distribution, with the afore-defined distributional parameters.
\end{itemize}
In addition, the decoder is defined as:
\begin{itemize}
\item A dense layer with \textit{relu} activation, mapping a latent element to a vector of dimensionality $7\times 7\times 64$.
\item Two deconvolutional layers of kernel size $3\times 3$, stride length $2$ and \textit{relu} activation; the \textit{filter counts} are $64$ and $32$ for the first and second layers.
\item An output deconvolutional layer of kernel size $3\times 3$, \textit{filter counts} $1$, stride length $1$ and \textit{sigmoid} activation for pixel values.
\end{itemize}
The autoencoder is fitted to minimize a custom loss, with a reconstruction term (through a cross-entropy loss) and the Kullback-Leibler divergence among latent mappings and a normal distribution $\mathcal{N}(\boldsymbol{0}, I)$. We use the \textit{Adam} optimizer, over $50$ epochs, with a constant learning rate of $1e^{-3}$ and with \textit{batch size} of $32$.

\subsection{Fashion-MNIST dataset}

The classifier and autoencoder are defined similarly to the above example. However, we add two additional \textit{dropout} layers (with probability $0.5$) after each \textit{max-pooling} operation in the classifier. Training proceeds with the \textit{Adam} optimizer, at a constant learning rate of $1e^{-3}$ with \textit{batch size} $32$. The classifier is trained for $10$ epochs using the cross-entropy as the cost function. The autoencoder is trained for $50$ epochs using a combination of binary cross-entropy and the Kullback-Leibler divergence as a regularisation term.

\subsection{CelebA dataset}

Images are centred around the face and cropped to size $128 \times 128$, further standardized to pixel values in the range $[0, 1]$. During training, we leverage data augmentation with random rotations; we use a \textit{maximum angle} of $\pm 18$ degrees, random translation by a maximum factor of $0.1$ and random horizontal flip.

The \textbf{classifier} is composed of $6$ convolutional blocks followed by a dense layer with \textit{softmax} activation. Each convolutional block utilizes a \textit{kernel} size of $3$ and \textit{stride} $1$, along with \textit{batch normalization}, \textit{dropout} with deactivation probability of $0.2$, \textit{relu} activation and \textit{max-pooling} (\textit{pool size} 2 and \textit{stride} 2). The number of channels in convolutional layers is, respectively, $32$, $64$, $128$, $128$, $256$ and $256$. The last block is followed by a flattening operation and a \textit{dropout} layer with deactivation probability $0.4$. 

We train this classifier for $5$ epochs using the \textit{Adam} optimizer with batch size $64$ and the \textit{cross-entropy} as cost function. The learning rate is decreased after each epoch by a factor of $0.8$; starting from $1e^{-4}$ for the \textit{smiling} and \textit{arched eyebrows} classifiers, and $3e^{-5}$ for the \textit{bags under eyes} classifier.

The \textbf{encoder} in the variational autoencoder is a series of $5$ convolutional blocks. Each block shares the same structure, with \textit{kernel} size $3$, \textit{stride} $2$, \textit{batch normalization} and \textit{leaky-relu} activation with negative slope coefficient of $0.3$. The number of filters at the output of each block is $32$, $64$, $128$, $256$ and $512$. After the last block we insert a flattening layer and two dense layers each with $256$ output neurons for the distributional mapping to the latent space. The \textbf{decoder} is a fully connected dense layer with $80192$ output neurons (reshaped into a $4\times 4\times 512$ activation map) followed by $5$ up-sampling blocks. Each block up-samples the input by a factor $2$ and feeds it into a convolutional layer with kernel size $3$ and stride $1$, followed by \textit{batch normalisation} and \textit{leaky-relu} activation with $0.3$ negative slope coefficient. The number of channels at the output of each block are $256$, $128$, $64$, $32$ and $3$ respectively. We apply an additional convolutional layer with kernel size $3$, stride $1$, $3$ output channels and \textit{sigmoid} activation for a final reconstructed RGB image with values restricted in the $[0, 1]$ interval.

The autoencoder is trained for $100$ epochs using the \textit{Adam} optimizer, with batch size $64$ and a learning rate of $5e^{-4}$ which is decreased after each epoch by a factor of $0.98$. We use a \textit{perceptual loss} function together with the Kullback-Leibler divergence regularisation term, following details on \citep{Hou7926714} (VAE-123 model).

\subsection{Attribution methods}

We use standard implementations of attribution methods with recommended parameters in corresponding publications or public repositories. In all cases, \textit{black+white} and \textit{counterfactual} variants of methods are implemented equivalently. For path methods requiring trapezoidal integration, we use $50$ bins with grayscale images and $25$ bins with high resolution images. The process to procure counterfactual fiducials is explained in Section \ref{sec:methods}.

\textbf{Vanilla IG} is implemented with a straight line as domain of integration. 

\textbf{Blur IG} is specified with an integration path which decreases blurring from a masked image, using successive Gaussian filters. The maximum standard deviation is set to the minimum required to maximise the average predictive entropy across train data.

\textbf{Guided IG} is configured s.t. the subset of pixels traversing value in each step is the $10\%$ with smallest partial derivatives of entropy wrt pixel values. We use $50$ steps.

\textbf{LIME} is implemented through \textit{quickshift} segmentation, with kernel $1$, maximum distance $5$ and ratio of $0.2$. We use a binomial mask with deactivation probability $0.2$, and \textit{Lasso} regression to attribute importances.

\textbf{SHAP} proceeds through $2*\text{(Pixel Count)}+ 2^{11}$ index perturbations of varying size; masked index points are re-sampled from their corresponding marginal distributions. We use \textit{Lasso} regression to attribute importances.

\textbf{CLUE} attributions are derived as the difference between an image and its decoded CLUE counterpart \citep[cf.][Appendix F]{antoran2021getting}. The cost function weighs reconstruction and uncertainty terms, and is tuned on a validation set. 

\textbf{Xrai} is implemented with Felzenszwalb's segmentation algorithm in order to retrieve masks. We use multiple scale values of $50, 100, 150, 250, 500$ and $1200$, as well as a dilation radius of $5$. This is applied to normalised images at range $[-1, 1]$ and size $224\times 224$ pixels. Resizing is undertaken with anti-aliasing. Segments are accepted for appending into attributions with a required difference of $50$ pixels.

\bibliography{perez_396.bib}

\end{document}
