\documentclass{midl}

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{tikz}
\usepackage{multirow}
\usepackage{hyperref}
\usepackage{caption}

\usepackage{mwe} % to get dummy images
\jmlrvolume{-- 239}
\jmlryear{2025}
\jmlrworkshop{Full Paper -- MIDL 2025 submission}
\editors{Accepted for publication at MIDL 2025}

\title[]{Feature Attribution for Deep Learning Models\\ through Total Variance Decomposition}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor
{\Name{Yinzhu Jin\nametag{$^{1}$}} \orcid{0009-0008-8904-446X} \Email{yj3cz@virginia.edu}\\
\addr $^{1}$ Dept. of Computer Science, University of Virginia, Charlottesville, VA, USA \\
\Name{Shen Zhu\nametag{$^{1, 2}$}} \orcid{0009-0001-3978-0687} \Email{sz9jt@virginia.edu}\\
\addr $^{2}$ Dept. of Electrical \& Computer Engineering, University of Virginia, Charlottesville, VA, USA \\
\Name{P. Thomas Fletcher\nametag{$^{1, 2}$}} \orcid{0000-0003-3417-2380} \Email{ptf8v@virginia.edu}
}
\begin{document}

\maketitle

\begin{abstract}
This paper introduces a new approach to feature attribution for deep learning models, quantifying the importance of specific features in model decisions. By decomposing the total variance of model decisions into explained and unexplained fractions, conditioned on the target feature, we define the feature attribution score as the proportion of explained variance. This method offers a solid statistical foundation and normalized quantitative results. When ample data is available, we compute the score directly from test data. For scarce data, we use constrained sampling with generative diffusion models to represent the conditional distribution at a given feature value. We demonstrate the method’s effectiveness on both a synthetic image dataset with known ground truth and OASIS-3 brain MRIs.
\end{abstract}

\begin{keywords}
Feature attribution, counterfactual explanation, generative diffusion model.
\end{keywords}

\section{Introduction}
\label{sec:intro}
Deep learning has achieved remarkable performance in image classification by leveraging complex neural network architectures to automatically extract and learn features. However, despite its success, deep learning often operates as a ``black box,'' where the internal workings and decision-making processes of the models are challenging to interpret. 
Numerous methods have been proposed to interpret model decisions, with saliency maps~\cite{selvaraju2017grad, sundararajan2017axiomatic,schulzrestricting} and counterfactual explanations (CE)~\cite{wachter2017counterfactual, goyal2019counterfactual, augustin2022diffusion} emerging as two of the most widely used techniques. Both strategies are great for discovering important features without any background knowledge. Nevertheless, for medical imaging tasks involving significant human expertise, explanations relying on understandable feature contributions are preferred. In contrast, techniques like CE that are designed to mimic human reasoning, do not appear to enhance trust in the system's predictions~\cite{wang2021explanations}.

We propose a metric to quantify a classifier's reliance on a specific target feature by decomposing the variance of model predictions. Our score reflects the proportion of prediction variance explained by the feature, based on its conditional distribution. Unlike saliency maps, which assign scores to image positions, our method applies to any feature with a learnable distribution. For example, in classification of Alzheimer's disease from brain MRI, our model is able to evaluate the importance of both the location of hippocampal voxels and the overall hippocampal volume. While large datasets often provide direct sampling from the data distribution conditioned on discrete features, sampling conditioned on a continuous feature is not directly possible. To address this, we use diffusion models~\cite{ho2020denoising} and guided sampling~\cite{chung2023diffusion} to model the conditional distribution.

In summary, our proposed evaluation metrics offer the following advantages:
\begin{itemize}
    \item Quantified importance evaluation: provides a measurable assessment of feature importance, which is particularly useful in tasks requiring human expertise.
    \item Broad applicability: applicable to any learnable or annotated features, both continuous and discrete, without relying on classifier robustness.
    \item Rooted in causality: based on principles of causality theory by observing outcomes when interfering with specific target features. 
%    \item  Independence from classifier robustness: does not require the evaluated classifier to be robust, or need another robust classifier for support. 
\end{itemize}

\section{Background}
We first introduce related interpretability techniques, then the diffusion model used for learning and sampling from conditional distributions.


\subsection{Causality based interpretation}
Counterfactual explanations provide intuitive insights by generating a new sample that flips the model’s decision with minimal changes to the original image. Early methods composited features from distractor images~\cite{goyal2019counterfactual}, while recent approaches use generative models like diffusion models~\cite{ho2020denoising} for better image quality. These methods often rely on classifier gradients to minimize distance~\cite{augustin2022diffusion, jeanneret2022diffusion}, but when applied to non-robust classifiers, they may generate adversarial examples. While \citet{augustin2022diffusion} can evaluate non-robust classifiers, they still rely on a robust classifier to mitigate this issue.

Other works interpret classifiers using features beyond pixels. For example, CaCE~\cite{goyal2019explaining} examines model predictions by varying feature values. Their metric calculates the difference in model outputs when a binary feature is set to negative versus positive. This approach is inherently limited to binary features.
In contrast, \citet{jin2024measuring} proposed attributing continuous features by neutralizing their influence, which they achieve by adjusting feature values to a baseline. However, the choice of this baseline value is not well justified. Both methods rely on variational autoencoders (VAEs)~\cite{kingma2013auto}, thus missing out on the advancements offered by state-of-the-art generative models.

\subsection{Diffusion models and guided sampling}\label{sec:backgrounds}
In a diffusion model~\cite{sohl2015deep, ho2020denoising, song2020score}, the forward process is a Markov process where Gaussian noise is added gradually to the original data $x_0$. At each time step $t$, $x_t$ is sampled from the distribution:
\begin{equation}
    q(x_t \mid x_{t-1}) := \mathcal{N}(\sqrt{1-\beta_t}x_{t-1},\beta_t \textbf{I}).
\end{equation}
The time variance schedule ${\beta_t}$ ensures the final distribution is approximately a standard Gaussian:
$
    q(x_T \mid x_0) \approx \mathcal{N}(\textbf{0},\textbf{I}).
$
In the reverse process, the goal is to learn $p_\theta(x_{t-1} \mid x_t)$, the distribution of $x_{t-1}$ given $x_t$, parameterized by $\theta$.


In DDPM~\cite{ho2020denoising}, the problem is simplified to predicting the added noise $\epsilon$ based on $x_t$ and $t$, formulated as $\epsilon_\theta (x_t, t)$. Alternatively, this can be viewed as a score based model with the score function:
\begin{equation}
    \nabla_{x_t} \log p_\theta(x_t) = - \frac{1}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t),
\end{equation}
where $\bar{\alpha}_t :=  \prod_i^t \alpha_i$ and $\alpha_i := 1-\beta_i $~\cite{dhariwal2021diffusion, song2020score}.

To draw samples from a conditional distribution given condition $c$, we look into the conditional score function
$
    \nabla_{x_t} \log p_\theta (x_t \mid c) = \nabla_{x_t} \log p_\theta (x_t) + \nabla_{x_t} \log p_\theta (c \mid x_t).
$
The second term could be the gradient of the classifier that predicts $c$~\cite{dhariwal2021diffusion}.
The diffusion posterior sampling (DPS)~\cite{chung2023diffusion} method generalizes to continuous conditions by replacing $p_\theta (c \mid x_t)$ from a classifier to a hypothetical Gaussian distribution:
\begin{equation}
\label{eq:dps}
    \nabla_{x_t} \log p_\theta (c=c_0 \mid x_t) = - \rho \nabla_{x_t} \| c_0 - g(\hat{x}_0) \|_2^2,
\end{equation}
with $\rho$ being a constant coefficient, $c_0$ being the given condition value, $\hat{x}_0$ being the expected $x_0$ given $x_t$, and g being the mapping from the data to the feature.


\section{Methods}
Our goal is to assess a classifier $f$ which is a mapping from input data $X \sim p(X)$ to $\{0,1\}$. We denote the classifier prediction as $Y:=f(X)$, which can be seen as a random variable. Similarly, we define a feature as the output of a mapping $g$ from the input data to some feature value $V:=g(X)$, which is another random variable. In our framework, the feature type can be very general, e.g., $V$ may be discrete, continuous, or multivariate.

\subsection{Causal model}
We represent the causal relationship of the variables involved in our analysis in Figure \ref{causal_structure}. The variable $V$ represents the target feature, and $Y$ represents the model prediction as defined above. Additionally, $W$ represents the exogenous variables besides $V$ that are present in the data and may affect the model prediction.

\begin{figure}[htbp]
\centering
\begin{tikzpicture}
    % Nodes
    \node (V) at (0,1) [double, circle, draw] {\textit{V}};
    \node (W) at (2,1) [circle, draw] {\textit{W}};
    \node (Y) at (1,0) [circle, draw] {\textit{Y}};

    % Edges
    \draw[->] (V) -- (Y);
    \draw[->] (W) -- (Y);
\end{tikzpicture}
\caption{The structural causal model. The double circle represents the variable being set.}
\label{causal_structure}
\end{figure}

Ideally, we would fix $W$ and observe how $Y$ changes with respect to different values of the feature of interest, $V$. In practice, however, $W$ is difficult to define, not tractable, and hard to control precisely. While one might try to fix $W$ by modifying the data along the direction of $\nabla g$, completely disentangling features remains challenging. We illustrate this with an example using the OASIS-3 dataset from Section \ref{sec:data}, consisting of 3D brain MRIs cropped around the hippocampus. The feature of interest, hippocampal volume, is estimated using a CNN regression model, $g$. We applied a DDIM encoder with guided denoising using the gradient of the trained regressor to generate a series of samples differing only in hippocampal volume. As shown in Figure \ref{ddim_series}, while the hippocampal volume changes as intended, surrounding structures also change.
%This might be due to the correlation existing in the dataset.

\begin{figure}[htbp]
\floatconts
  {ddim_series}
  {\caption{Right: samples of DDIM encoding and guided sampling for enlarging the hippocampus. Left: an illustration of the hippocampus, outlined in red.}}
  {\includegraphics[width=0.8\linewidth]{ddim_series.png}}
\end{figure}


On the other hand, we can measure $V$ much more precisely if we have a good estimation of $g$. In such a case, it is more feasible to fix $V$ up to a measurable amount of error and perturb $W$ randomly. With this strategy, we now design a quantitative score that measures the effect of $V$ on $Y$.

\subsection{Feature importance score}
To design our feature importance score, recall a basic theorem in probability theory where the variance is decomposed into two parts using the conditional distribution:
\begin{theorem}[Law of total variance~\cite{fox2015applied}]
If $A$ and $B$ are random variables on the same probability space, and the variance of $A$ is finite, then
\begin{equation}
\label{variance_decomp}
\textnormal{Var}(A) = \textnormal{E}[\textnormal{Var}(A \mid B)] + \textnormal{Var}[\textnormal{E}(A \mid B)].
\end{equation}
\end{theorem}
The two terms on the right-hand side are often known as the ``unexplained'' and the ``explained'' components of the variance, respectively.

Now let's consider decomposing the variance of the classifier prediction $Y$ using the distribution conditioned on the target feature $V$.
%The decomposition can reflect what portion of the variance can be explained by $V$:
%\begin{equation}
%\label{Y_variance_decomp}
%\textnormal{Var}(Y) = \textnormal{E}[\textnormal{Var}(Y \mid V)] + \textnormal{Var}[\textnormal{E}(Y \mid V)].
%\end{equation}
We define our feature importance score over the dataset (namely, global score) as the fraction of explained variance:
\begin{equation}
\text{Score}_V : = 
\frac{\textnormal{Var}[\textnormal{E}(Y \mid V)]}{\textnormal{Var}(Y)} =  1- \frac{\textnormal{E}[\textnormal{Var}(Y \mid V)]}{\textnormal{Var}(Y)}.
\end{equation}
We can verify that the proposed score always lies within the interval $[0,1]$ because both terms in Equation \ref{variance_decomp} are non-negative.

To intuitively understand how this score reflects the importance of a target feature, consider two extreme scenarios. First, if $Y$ is independent of $V$, then the conditional distribution $P(Y \mid V)$ is the same as $P(Y)$, which means $\textnormal{Var}(Y \mid V) = \textnormal{Var}(Y)$. In this case, the score $\textnormal{Score}_V$ will be $0$. Conversely, if $Y$ is fully determined by $V$, then $\textnormal{Var}(Y \mid V) = 0$, resulting in $\textnormal{Score}_V = 1$.

We further extend the feature importance score by introducing a score for each point $V=v_0$ (namely, local score):
\begin{equation}
    \text{Score}_{V=v_0} := 
      1- \frac{\textnormal{Var}(Y \mid V=v_0)}{\textnormal{Var}(Y)}.
\end{equation}
This extension is consistent with our global score, as shown in the following equation:
\begin{equation}
    \textnormal{E}_{v_0}[\text{Score}_{V=v_0}]
    = \ 1-  \frac{\textnormal{E}_{v_0}\left[ \textnormal{Var}(Y \mid V=v_0) \right]}{\textnormal{Var}(Y)} \\
    = \ \text{Score}_V.
\end{equation}
Similarly, the local score indicates the importance of a feature when it takes a specific value. If the feature $V$ is informative about $Y$, the variance of $Y$ should decrease, leading to a higher local score. This local score is similar to the $R^2$ score~\cite{pearson1901liii} widely used in regression analysis. Like the $R^2$ score, it has an upper bound of $1$, but unlike the $R^2$ score, it can also fall below 0. A negative local score suggests that when $V$ takes this certain value, the variability in the classifier's predictions is greater than the variability over the dataset.
%, i.e., $\textnormal{E}_{v_0}[ \textnormal{Var}(Y \mid V=v_0)]>\textnormal{Var}(Y)$.

\subsection{Sampling from conditional distribution}
To calculate the proposed score, we need to sample from the conditional distribution $P(Y \mid V=v_0)$. Ideally, we would like to have enough real samples to represent this distribution at every $v_0$. When $V$ is a categorical feature, and if we are given large enough test set, this can easily be satisfied. However, for continuous $V$, we might not have enough samples with $V\in (v_0-\epsilon, v_0+\epsilon)$ for some small $\epsilon>0$. In this scenario, we propose to use a generative model to obtain new samples. Although there is no restriction for the type of generative models, we adopt diffusion models~\cite{ho2020denoising} for their high sample quality. As introduced in Section \ref{sec:backgrounds}, we use a guided sampling method to constrain $V$. 

When performing guided sampling, $\nabla_{x_t} \log p_\theta (x_t) $ is equivalent to performing the usual DDPM denoising step, while Equation \ref{eq:dps} is the extra drift term of DPS method~\cite{chung2023diffusion}. The feature mapping $g$ can be a regression model trained on the training set given annotations. We also experimented with normalizing the gradient $\nabla_{x_t} \| c_0 - g(\hat{x}_0) \|_2^2$ similar to previous work that applied normalization to the classifier guidance~\cite{augustin2022diffusion}. The normalization helps stabilize the sampling.

In conclusion, at each time step $t$, the guided denoising operation is:
\begin{equation}
    \begin{aligned}
    &\hat{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}\left(x_t - (1-\bar{\alpha}_t)\epsilon_{\theta}(x_t,t)\right),\quad z \sim \mathcal{N}(\textbf{0},\textbf{I}),\\
    &x_{t-1}' =  \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t +\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\hat{x}_0 + \sigma_t z,\\
    &x_{t-1} = x_{t-1}' - \rho \frac{\nabla_{x_t} \| c_0 - g(\hat{x}_0) \|_2^2}{\|\nabla_{x_t} \| c_0 - g(\hat{x}_0) \|_2^2\|_2}.
    \end{aligned}
\end{equation}
The first three steps are regular DDPM denoising with noise standard deviation $\sigma_t = \sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t}$. And the last step guides the samples closer to the constraint. Due to the stochasticity of diffusion models, a small portion of samples will still fall far away from the constraint. We remove these samples by examining the resulting $V$ value using $g$. Note that we perform guided sampling using feature mapping rather than the evaluated classifier, avoiding the risk of generating adversarial samples when using a non-robust classifier.

\section{Experiments}
We evaluate our proposed metric on one synthetic image dataset and one medical image dataset, and compare it with the diffusion model-based counterfactual explanation method~\cite{augustin2022diffusion} (``diffusion CE'' for short). Both methods use the same diffusion models, but unlike their approach, our experiments use classifiers not specifically trained for robustness. All experiments are implemented with PyTorch~\cite{paszke2017automatic}.

\subsection{Datasets}
\label{sec:data}
%remove CelebA


\noindent\textbf{Ellipse dataset} is a synthetic dataset generated using the package by \citet{ellipse}. It contains 10,000 images of white ellipses on black backgrounds, varying in position, orientation, size, and aspect ratio. The images are labeled into two categories based on the ellipses' aspect ratio. We use aspect ratio and size, an irrelevant feature, as our target features.

\noindent\textbf{Brain MRI ROI data} is from the OASIS-3 dataset~\cite{lamontagne2019oasis}, consisting of 929 subjects diagnosed as cognitively normal (CN) or with Alzheimer’s Disease (AD), each with one MRI session. The data was stratified, with 186 subjects for testing. For each subject, we extracted two $64 \times 64 \times 64$ regions of interest (ROIs) centered on the hippocampi, mirroring the right hippocampus along the sagittal plane. The dataset is imbalanced, with 77.5\% CN samples. We focus on conditioning with fixed hippocampal voxels or hippocampal volume, as both are known to relate to AD~\cite{sarica2018mri, zhu2024quantifying}.





\subsection{Results}
We now present and discuss the scores obtained using our method.
Details on the models, training, and sampling process, together with more samples are provided in the appendix.

\subsubsection{Ellipse}
Since the ellipse dataset is a synthetic dataset, we know that the only important feature is the aspect ratio.
Figure \ref{fig:ellipse_sample} shows samples generated with constraining either the aspect ratio or the size.
We computed our scores for aspect ratio and size features. From the global scores reported in Table \ref{tab:ellipse}, we can see that our metric can indeed reflect this. Local scores are also reported in Figure \ref{fig:ellipse_local}.
%We can further look into the local scores from the scatter plot given in Figure \hyperref[fig:ar_local]{6(a)}. With comparison to the actual aspect ratio value distribution (as shown in Figure \hyperref[fig:ar_hist]{6(b)}), we can see that the local value is small when there is larger overlap of two classes. In other words, when the aspect ratio value falls in this overlapped region, the classifier predictions varies a lot.


\begin{figure}[htbp]
\begin{minipage}{0.45\textwidth}
  \captionof{table}{Global scores of ellipse classifier.}
  \floatconts
  {tab:ellipse}
  {}
  {
  \begin{tabular}{@{}lc@{}}
    \hline
    Feature & Global score\\
    \hline
    Aspect ratio & $0.871$\\
    Size & $0.049$ \\
    \hline
  \end{tabular}
  }
\end{minipage}\hfill
\begin{minipage}{0.53\textwidth}
    \floatconts
    {fig:ellipse_sample}
    {\caption{Ellipse samples generated with constrained aspect ratios (top) and sizes (bottom).}}
    {\includegraphics[width=0.9\textwidth]{ellipse.png}}
\end{minipage}
\end{figure}


\begin{figure}[tbh!]
    \floatconts
    {fig:ce_ellipse}
    {\caption{Sample pairs of original ellipse images and their diffusion CE.}}
    {\includegraphics[width=.56\textwidth]{ce_ellipse.png}}
\end{figure}

The diffusion CE led to $82.5\%$ of samples flipping model predictions. Some original and counterfactual image pairs with flipped predictions are shown in Figure \ref{fig:ce_ellipse}. While aspect ratios generally change, size - a feature known to be unimportant—sometimes changes as well (see the second column). This mirrors the findings in Figure \ref{ddim_series}, highlighting the difficulty of changing an image in the gradient direction without affecting other dimensions.

\begin{figure}[t]
  \hspace{40pt}
  \begin{minipage}{0.37\textwidth}
      \floatconts
      {fig:ar_local}
      {}
      {\includegraphics[width=\textwidth]{ar_scatter.pdf}}
  \end{minipage}\hspace{30pt}
  \begin{minipage}{0.37\textwidth}
      \floatconts
      {fig:ar_hist}
      {}
      {\includegraphics[width=\textwidth]{ar_hist.pdf}}
  \end{minipage}
  \caption{Local scores of the aspect ratio feature in the ellipse data (left) and the distribution of the log aspect ratio values in the training data (right).}\label{fig:ellipse_local}
\end{figure}


\subsubsection{Brain ROI}

For the brain ROI dataset, hippocampal volume is constrained using the trained regression model, while the hippocampus is constrained by masking areas outside the hippocampus using the ground truth segmentation, akin to an inpainting task. Random samples with constrained hippocampus are shown in Figure \ref{fig:hippo_inpaint}, where the hippocampus (in red) closely matches the original, while surrounding brain areas vary. More examples are in Figure \ref{fig:more_inpaint}.

\begin{figure}[htbp]
  \begin{minipage}{0.53\textwidth}
    \floatconts
    {fig:hippo_inpaint}
    {\caption{Original image (left most) with hippocampus illustrated in red, and randomly generated samples with constrained hippocampus.}}
    {\includegraphics[width=\textwidth]{hippo_inpaint.png}}
  \end{minipage}\hfill
  \begin{minipage}{0.42\textwidth}
  \captionof{table}{Global scores of AD classifier.}
  \floatconts
  {tab:hippo}
  {}
  {
  \begin{tabular}{@{}lc@{}}
    \hline
    Feature & Global score\\
    \hline
    Hippocampal volume & $0.280$\\
    Hippocampus & $0.448$ \\
    \hline
  \end{tabular}
  }
  \end{minipage}
\end{figure}

Global scores are reported in Table \ref{tab:hippo}, and local scores for corresponding hippocampal volumes are shown on the left side of Figure \ref{fig:hippo_local}. The global score was computed by averaging the separately calculated expectations for the AD and CN groups, addressing class imbalance, similar to balanced accuracy. These scores suggest that hippocampal volume is a useful classifier feature, with the entire hippocampus explaining more variation in outputs, while the remainder may reflect other factors, like ventricle volume. The local scores are higher when the volume is notably small or large, as expected, since this strongly indicates AD or CN, as shown in the per-class volume distributions on the right side.

\begin{figure}[t]
    \floatconts
    {fig:ce_hippo}
    {\caption{Sample pairs of original brain ROI images and their diffusion CE. First two columns are samples flipping from CN to AD, and the last two from AD to CN.}}
    {\includegraphics[width=.56\textwidth]{ce_hippo.png}}
\end{figure}


\begin{figure}[t]
  \hspace{40pt}
  \begin{minipage}{0.37\textwidth}
      \floatconts
      {fig:vol_local}
      {}
      {\includegraphics[width=\textwidth]{hippo_vol_scatter.pdf}}
  \end{minipage}\hspace{30pt}
  \begin{minipage}{0.37\textwidth}
      \floatconts
      {fig:vol_hist}
      {}
      {\includegraphics[width=\textwidth]{hippo_vol_hist.pdf}}
  \end{minipage}
  \caption{Local scores for the hippocampal volume feature in the brain ROI dataset (left) and the distribution of hippocampal volumes in the training data (right).}\label{fig:hippo_local}
\end{figure}


Diffusion CE successfully flipped model decisions for all test samples. We attribute this to the VAE-enhanced classifier guidance, as discussed in Appendix \ref{sec:vae_latent}. Sample pairs are shown in Figure \ref{fig:ce_hippo}, where changes in hippocampal and ventricle volumes are visible, but subtle. Thus, the CE method aligns with our findings: while CE offers an intuitive explanation, our metric provides a quantitative assessment.

\subsection{Sampling evaluation}
We assess the effectiveness of constrained sampling methods by evaluating data coverage and estimating the sampling variances of our scores.

\begin{table}[htbp]
  \caption{Coverage score of different sampling methods. }
  \label{tab:coverage}
  \begin{minipage}{0.45\textwidth}
    \floatconts
    {}
    {\centering (a) The ellipse data}
    {
      \begin{tabular}{@{}lccc@{}}
        \hline
        \multirow{2}{*}{Sampling method} & \multicolumn{3}{c}{Coverage}\\
         & k=5 & k=3 & k=1 \\
        \hline
        Plain DDPM & $0.978$ & $0.972$ & $0.887$ \\
        Constrained aspect ratio & $0.991$ & $0.975$ & $0.887$\\
        Constrained size & $0.991$ & $0.978$ & $0.905$ \\
        \hline
      \end{tabular}
    }
  \end{minipage}\hfill
  \begin{minipage}{0.45\textwidth}
    \floatconts
    {}
    {\centering (b) The brain MRI ROI data}
    {
      \begin{tabular}{@{}lcc@{}}
        \hline
        \multirow{2}{*}{Sampling method} & \multicolumn{2}{c}{Coverage}\\
         & k=3 & k=1 \\
        \hline
        Plain DDPM & $1.000$ & $0.994$ \\
        Constrained hippo. vol. & $1.000$ & $1.000$  \\
        Constrained hippo. & $1.000$  &  $1.000$ \\
        \hline
      \end{tabular}
    }
  \end{minipage}
\end{table}

To evaluate sample coverage of the real data distribution, we combine samples from different conditions and compare them to the ground truth test set. As a baseline, we use plain DDPM sampling. The coverage metric from \citet{naeem2020reliable} measures the fraction of real samples with generated samples in their $k$-nearest neighborhood using $L_2$ distance in the embedding space. We used a VGG16~\cite{simonyan2015very} encoder for the ellipse dataset and our own VAE encoder for the brain ROI dataset. Results in Table \ref{tab:coverage} show our method slightly outperforms plain DDPM sampling, likely due to real data-guided conditions. We conclude our sample distribution covers the real distribution well.


\begin{table}[htbp]
   \caption{Estimated variance of our scores using bootstrap method.}
  \label{tab:bootstrap}
  \begin{minipage}{0.46\textwidth}
      \floatconts
      {}
      {\centering (a) The ellipse data}
      {
        \begin{tabular}{@{}lcc@{}}
          \hline
          Feature & Local score & Global score\\
          \hline
          Aspect ratio & $6.74e-4 $ & $3.40e-7 $ \\
          Size & $9.85e-4 $ & $5.00e-7 $  \\
          \hline
        \end{tabular}
      }
  \end{minipage}\hspace{20pt}
  \begin{minipage}{0.46\textwidth}
      \floatconts
      {}
      {\centering (b) The brain MRI ROI data}
      {
        \begin{tabular}{@{}lcc@{}}
          \hline
          Feature & Local score & Global score\\
          \hline
          Hippo. vol. & $3.30e-3 $ & $2.75e-5 $ \\
          Hippocampus & $3.20e-3 $ & $2.41e-5 $  \\
          \hline
        \end{tabular}
      }
  \end{minipage}
\end{table}

Given that the data distribution is well covered, we perform bootstrapping tests~\cite{efron1992bootstrap} to estimate score variation due to sampling. This involves resampling the existing samples 5,000 times, calculating a new score for each, and computing the variance of the resulting scores. The variances for local and global scores are reported in Table \ref{tab:bootstrap}. For local scores, we compute the mean variance across feature values. The small variance suggests that our score remains informative even with relatively small sample sizes.

\section{Conclusion and Discussion}
Our proposed metrics are designed to assess the extent to which a classifier depends on a well-known, meaningful feature, either across the entire dataset or at specific feature value points.
The results show that it effectively quantifies feature importance for classifiers, offering a normalized range for both local and global assessments.
Furthermore, our metrics could be applied to raw imaging data, using foundational image generative models, as long as there is either a closed-form or deep learning-based mapping to the feature.

One limitation is the computational cost of generating samples with the diffusion model (~30 hours on a single A100 GPU on brain ROI data), but this trade-off can be managed through evaluation, as shown in our experiments. Another limitation is that our metrics do not extend to the subject level, which remains a direction for future work.




\midlacknowledgments{This work was partially supported by NSF Smart and Connected Health grant 2205417.}


\bibliography{midl25_239}


\appendix

\section{Experiments on CelebA dataset}
In this section, we present experiments on the CelebA dataset with binary features, comparing our method to CaCE~\cite{goyal2019explaining}. This demonstrates the applicability of our approach to natural images, particularly when data is abundant and additional sampling is unnecessary.

\subsection{Setup}
CelebFaces Attributes (CelebA)~\cite{liu2015faceattributes} is a publicly available dataset of celebrity face photos annotated with multiple binary attributes. We cropped the images into squares and resized them to $128\times128$ pixels. We focus on the gender classification task.
Among the binary annotations from the original dataset, we chose some of them that are apparently related or unrelated to gender as our target features (as listed in Table \ref{tab:celeba}).

We trained a residual network~\cite{he2016deep} implementation by \cite{rw2019timm} for the gender classification. 
Detailed architectural information for the specific variant we used can be found at \url{https://huggingface.co/timm/resnet10t.c3_in1k}.
We opted not to use pre-trained weights as they did not improve classifier performance.Since the dataset is large enough and the feature is binary, it was not necessary to train a generative model.

\subsection{Results}

\begin{table}[htbp]
  \floatconts
  {tab:celeba}
  {\caption{Our proposed score and CaCE score for binary features on CelebA gender classifier.}}
  {
  \begin{tabular}{@{}lcccc@{}}
    \hline
    \multirow{2}{*}{Feature} & \multirow{2}{*}{Global (ours)} & \multicolumn{2}{c}{Local (ours)} & \multirow{2}{*}{CaCE}\\
    & & negative & positive\\
    \hline
    Wearing lipstick & $0.639$ & $0.266$ & $0.982$ & $-0.625$\\
    Heavy makeup & $0.404$ & $0.003$ & $0.992$ & $-0.771$ \\
    Arched eyebrows & $0.154$ & $-0.067$ & $0.708$ & $-0.420$\\
    Beard & $0.272$ & $0.158$ & $0.935$ & $0.712$\\
    5 o'clock shadow & $0.180$ & $0.093$ & $0.964$ & $0.683$ \\
    Blurry & $1e-4$ & $0.002$ & $-0.027$ & $0.029$\\
    \hline
  \end{tabular}
  }
\end{table}

The results of our proposed scores for various binary features in CelebA classification are presented in Table \ref{tab:celeba}. 
Our analysis reveals that features related to makeup and facial hair are among the most significant, which aligns with real-world expectations. These features have scores significantly higher than the last feature, ``blurry'', which indicates photo blurriness and is unrelated to gender. Additionally, we observe that these important features exhibit much higher local scores in the positive class compared to the negative class. This indicates that while a face with makeup is strongly indicative of a female, a face without makeup could belong to either gender with considerable probabilities. The same interpretation applies to facial hair.

Our scores generally align with CaCE scores, with both methods assigning larger absolute values to important features. CaCE emphasizes the influence of feature values on predictions and indicates the direction of this influence: a negative value suggests a higher likelihood of classification as female, while a positive value suggests a higher likelihood of classification as male. In contrast, our method evaluates the ``usefulnes'' of the target feature. For instance, two features related to facial hair receive relatively lower scores from our model due to the scarcity of positive samples in the dataset, which make up only $14.6\%$ and $10.0\%$, respectively. Since these features are only strong indicators when being positive (as reflected by our local scores), assigning them lower importance is justified.

\begin{figure}[htbp]
    \floatconts
    {fig:ce_celeba}
    {\caption{Original images (top row) from CelebA and corresponding counterfactual explanations (bottom row) generated using \cite{augustin2022diffusion}.}}
    {\includegraphics[width=.5\textwidth]{ce_celeba_small.pdf}}
\end{figure}

When applying the diffusion CE, $62.2\%$ of the generated explanation samples successfully flipped the model predictions.  Sample pairs of original images and their corresponding counterfactual explanations, where the model’s prediction was changed, are shown in Figure \ref{fig:ce_celeba}.
The observed changes are generally quite subtle. This, along with the low rate of model prediction flipping, may be due to the classifier not trained for robustness. Additionally, the method appears to prioritize features affecting fewer pixels. For instance, it alters the eyebrow (shown in the first column) or lip makeup (shown in the third column) but not the facial hair. We believe this is related to the strategy to stay close to the original data point by minimizing the $L_1$ distance, and we anticipate similar results with $L_2$ distance. In contrast, we believe our method is not limited to robust classifiers and does not favor features affecting smaller regions.


\section{Models}
We denote the batch size as $N$.

\subsection{Ellipse classifier and regression models}

For the ellipse dataset, we used a basic convolutional neural network (CNN)~\cite{lecun1998gradient} classifier consisting of four convolutional layers and two linear layers since its simplicity. For the regression models on the aspect ratio and volume, we adopted the same architectures. The DDPM was trained with regular U-Net~\cite{ronneberger2015u} backbone.

\begin{table}[htbp]
  \floatconts
  {}
  {\caption{Summary of the ellipse classifier and regression models architectures.}}
  {\begin{tabular}{lc}
    \hline
    Layer Type & Output Size  \\
    \hline
    Input & $(N, 1, 32, 32)$ \\
    CNNBlock & $(N, 32, 16, 16)$ \\
    CNNBlock & $(N, 64, 8, 8)$ \\
    CNNBlock & $(N, 64, 4, 4)$ \\
    CNNBlock & $(N, 64, 2, 2)$ \\
    Flatten & $(N, 256)$\\
    Linear \& ReLU & $(N, 64)$\\ 
    Linear (\& Sigmoid for the classifier) & $(N, 1)$\\
    \hline
  \end{tabular}}
\end{table}

The CNN block is a basic convolutional block as described in Table \ref{tab:cnn}.

\begin{table}[htbp]
  \floatconts
  {tab:cnn}
  {\caption{CNN block used for the ellipse dataset.}}
  {\begin{tabular}{lc}
    \hline
    Layer type & Kernel Size \\
    \hline
    Input  & - \\
    2D convolution  & $3\times 3$ \\
    ReLU & - \\
    2D max pooling & $2\times 2$\\
    \hline
    \end{tabular}
  } 
\end{table}

\subsection{Brain ROI classifier and regression models}
We employed a 3D variant of a residual network~\cite{solovyev20223d}, designed to mimic the gender classifier we trained on CelebA, for AD classification and hippocampal volume regression. This version replaces 2D convolutions, poolings, and normalizations with their 3D counterparts, while keeping the kernel sizes, strides, number of channels, and batch normalization parameters unchanged. As before, we trained the model from scratch on our dataset.


\subsection{U-Net used for the DDPM on 2D datasets}
We employed a standard U-Net backbone for the DDPM trained on the ellipse and CelebA datasets. Although our metrics did not require a diffusion model for CelebA, we trained one to perform diffusion counterfactual explanation (CE). We used an implementation that is publicly available at \url{https://github.com/lucidrains/denoising-diffusion-pytorch}. The initial convolution dimensions were $32$ for the ellipse data and $64$ for the CelebA data. The down-sampling and up-sampling paths each consist of four blocks, with each block comprising two ResNet blocks and a linear attention module. The total number of learnable parameters is $9.2$ M for the ellipse data and $35.7$ M for CelebA. For further details, please refer to the aforementioned library.


\subsection{The latent diffusion model on brain ROIs}
 Due to the high-dimensional nature of this data, we trained a latent diffusion model~\cite{rombach2022high}, which combines a VAE and a diffusion model in the latent space. We adopted the 3D variant from~\cite{pinaya2023generative}, publicly accessible at \sloppy\url{https://github.com/Project-MONAI/GenerativeModels}. This library provides a 3D variant of the original latent diffusion model specifically designed for biomedical applications.

We used a shallow autoencoder that downsamples spatial dimensions by a factor of $2$, yielding a latent dimension of $1 \times 32 \times 32 \times 32$. The encoder and decoder each include two ResNet blocks with internal channels of $32$ and $64$. The model has $1.2$ M learnable parameters.
For the U-Net, we used $3$ blocks for both the downsampling and upsampling paths, with each block comprising two ResNet blocks. The internal channel numbers are $256$, $512$, and $768$, respectively. The total number of trainable parameters is $424.3$ M.
For detailed information on the architectures, please refer to the aforementioned library.


\section{Training setup}
All classifiers and regression models are trained to optimize the performance on validation sets that are randomly split out from the training set. The performances on the test sets are shown in \ref{tab:performance}. 
The balanced accuracy is reported for AD prediction because of class imbalance in sample numbers.
Due to the small sample size, each training sample from the brain ROI data was augmented with ten random 3D rotations with angles $\alpha \sim \mathcal{U}(0,10^\circ)$. The AD classifier was trained with a weighted sampler to counteract the unbalanced distribution.


\begin{table}[h!]
 \caption{Performance of classifiers and regression models.}\label{tab:performance}
  \floatconts
  {tab:example}%
  {\centering{(a) Classifiers}}%
  {
  \begin{tabular}{@{}ll@{}}
    \hline
    Task & Accuracy \\
    \hline
    Gender classification on CelebA & $\ \ 0.966$   \\
    Ellipse classification & $\ \ 0.906$ \\
    AD prediction on brain ROI & $\ \ 0.836$  \\
    \hline
  \end{tabular}
  }
  \vspace{10pt}
  \floatconts
  {tab:example}%
  {\centering{(b) Regression models}}
  {
  \begin{tabular}{@{}lc@{}}
    \hline
    Task & $R^2$ score \\
    \hline
    Ellipse aspect ratio prediction & $0.999$ \\
    Ellipse size prediction & $0.998$ \\
    Hippo. vol. prediction on brain ROI & $0.864$ \\
    \hline
  \end{tabular}
  }
 
\end{table}


\subsection{Data preprocessing}
Brain MRIs were cropped around the hippocampi and augmented with random 3D rotations up to 10 degrees.
We used Freesurfer segmentations~\cite{fischi2002whole} from the original dataset to identify the hippocampus regions.
We further adjusted the contrast by normalizing pixel intensities to the range $0$ to $1$ using the $10$th and $90$th percentiles as thresholds.

\subsection{Classifiers and regression models}
All classifiers and regression models were trained using the Adam optimizer with a learning rate of $1e-5$. An $L_2$ regularization with a weight of $1$ was applied to the ellipse size regressor. Early stopping was used when performance on the validation set ceased to improve. However, none of the classifiers are specifically trained for robustness.

\subsection{Diffusion models}
For all diffusion models, we used a linear time schedule with $\beta_1 = 0.0015$ and $\beta_T = 0.0205$, and a total of $T=1000$ time steps. The U-Net was trained to predict noise until full convergence on the training set, using the Adam optimizer with a learning rate of $1e-5$.

\subsection{VAE for the latent diffusion}\label{sec:vae_latent}
For the VAE used for dimensional reduction on the brain ROI data, we trained it with a combination of $L_1$ reconstruction loss, KL-divergence loss, and perceptual loss. The perceptual loss was computed using a SqueezeNet model trained on ImageNet, with $25\%$ of 2D slices randomly selected along different dimensions. The KL-divergence term was weighted at $1e-7$ (summed across all latent dimensions), while the perceptual loss was weighted at $1e-3$.

During guided sampling, the gradient is backpropagated through the classifier and then through the decoder to the latent space. Since VAE decoder is trained with noise infusion, and the latent representation is decoded using this trained decoder, we assume this process enhances the robustness of the AD classifier guidance.

\section{Diffusion sampling setup}
For the noise coefficient, we used $\sigma_t = \sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t} $ for all the DDPM sampling methods.



\subsection{Constrained sampling (for our metrics)}
We report the coefficient $\rho$ (see Eq.(13)) we used for each feature in Table \ref{tab:rho}.
\begin{table}[htbp]
  \floatconts
  {tab:rho}
  {\caption{$\rho$ used for constrained sampling}}
  {\begin{tabular}{lc}
      \hline
      Feature & $\rho$ \\
      \hline
      Ellipse aspect ratio  & $0.1$ \\
      Ellipse size     & $0.1$\\
      \multirow{2}{*}{Hippocampal volume} & $0.03$ ($t>400$) \\
       & 0.04 ($t\leq 400$) \\
      Hippocampus & $0.4$ \\
      \hline
    \end{tabular}
  }
\end{table}

For the constrained hippocampus sampling, we dilated the ground truth hippocampi masks by $2$ pixels to include edge information.

\begin{figure}[htbp]
  \begin{minipage}{0.32\textwidth}
      \floatconts
      {fig:ar_result}
      {\centering (a) Ellipse log aspect ratios}
      {\includegraphics[width=\textwidth]{ar_target_result.pdf}}
  \end{minipage}\hfill
  \begin{minipage}{0.32\textwidth}
      \floatconts
      {fig:sz_result}
      {\centering (b) Ellipse sizes}
      {\includegraphics[width=\textwidth]{size_target_result.pdf}}
  \end{minipage}\hfill
  \begin{minipage}{0.32\textwidth}
      \floatconts
      {fig:vol_result}
      {\centering (c) Hippocampal volumes}
      {\includegraphics[width=\textwidth]{hippo_vol_target_result.pdf}}
  \end{minipage}
\caption{Target and result feature values with constrained DDPM sampling.}\label{fig:target_result}
\end{figure}

The feature values were sampled from the real test data. We generated 200 samples per feature value and removed those with feature values (as evaluated by the regression model) deviated by more than $\pm 0.3$ standard deviations from the constraint. This thresholding was not applied to the hippocampus feature in the brain ROI data due to the absence of a well-defined standard deviation. We present the result feature values corresponding to the target values in Figure \ref{fig:target_result}, demonstrating that the feature values are well-constrained.


\subsection{Diffusion counterfactual explanation}
We used the same diffusion model  as for the constrained sampling. The weights for the classifier and guidance, and the $L_1$ distance guidance to the original sample, are set to $0.1$ and $0.15$, respectively, as specified in their paper. We followed their paper~\cite{augustin2022diffusion}, starting sampling from the noisy images at time step $t=\frac{T}{2}=500$.


\section{Samples from the constrained sampling}

\begin{figure}[htbp]
  \floatconts
  {fig:ellipse_samp_ar}
  {\caption{More samples with constrained aspect ratio: each row has the same target aspect ratio.}}
  {\includegraphics[width=0.9\textwidth]{more_ar_samples.png}}
\end{figure}

Note that while our brain ROI data is 3D, we are only showing a central slice. As a result, it may not be intuitive to assess the constrained hippocampal volume from these 2D slices. For quantitative results, please refer to Figure \hyperref[fig:vol_result]{7(c)}.


\begin{figure}[htbp]
  \floatconts
  {fig:ellipse_samp_sz}
  {\caption{More samples with constrained size: each row has the same target size.}}
  {\includegraphics[width=0.9\textwidth]{more_sz_samples.png}}
\end{figure}
\nopagebreak
\begin{figure}[htbp]
  \floatconts
  {}
  {\caption{Samples with constrained hippocampal volume: each row has the same target volume.}}
  {\includegraphics[width=0.9\textwidth]{hippo_vol_samples.png}}
\end{figure}

\begin{figure}[htbp]
  \floatconts
  {fig:more_inpaint}
  {\caption{Samples with constrained hippocampus: each row has the same target hippocampus.}}
  {\includegraphics[width=0.9\textwidth]{more_hippo_inpaint_samples.png}}
\end{figure}

\newpage
\section{Linear Regression Analysis}

Given the analogy between our proposed metrics and the $R^2$ score used in linear regression, a natural question arises: could linear regression be applied to feature attribution? In Figure \ref{fig:linear_reg}, we fit linear regression models to the classifier logits (the outputs before the Sigmoid activation), using the interpretable feature as the input. This allows us to calculate $R^2$ scores, which are $0.968$ for the ellipse aspect ratio and $0.467$ for hippocampal volume. However, these values are not directly comparable to our metrics, as they rely on continuous logits, whereas we use discrete classes. While this approach may seem reasonable for the given examples, we argue that there are several reasons why our method cannot be replaced by such a simple strategy:
\begin{itemize}
    \item The method assumes a linear relationship between the interpretable feature and the classifier output, which is not always the case. The examples presented are not perfectly linear, and one could imagine a more extreme case where one class consists of ellipses with aspect ratios between $2$ and $2.5$, and the other class includes ellipses with aspect ratios either smaller than $2$ or larger than $2.5$. The resulting scatter plot would exhibit a $U$-shaped distribution.
    \item This method does not work well for more complex features like the hippocampal region, which lacks a fixed dimensionality in the raw input space. 
    \item Unlike our method, this approach cannot produce local scores specific to given feature values. Additionally, it is insensitive to decision boundaries.
\end{itemize}



\begin{figure}[htbp]
  \begin{minipage}{0.47\textwidth}
      \floatconts
      {fig:ar_result}
      {\centering (a) Ellipse log aspect ratios}
      {\includegraphics[width=\textwidth]{lin_reg_ar.png}}
  \end{minipage}\hfill
  \begin{minipage}{0.47\textwidth}
      \floatconts
      {fig:sz_result}
      {\centering (b) Hippocampal volumes}
      {\includegraphics[width=\textwidth]{lin_reg_vol.png}}
  \end{minipage}
\caption{Scatter plots of feature values versus classifier logits (blue), with the corresponding fitted linear regression models (orange). }\label{fig:linear_reg}
\end{figure}


%\let\clearpage\relax
\end{document}
