\section{Experimental Results}\label{sec:eval}

We implemented the \nito algorithm and present the results of evaluation of the explanations of absence produced by our implementation on
three different models over three different, publicly available, datasets: one MRI and two CT scans.
While there is nothing in our definition that is model or dataset dependent, we are interested to see both the variance in 
\betagood on real-world data and the effect of different masking values.
The model for the MRI data is a pretrained CNN based on the ResNet50 architecture~\citep{legastelois2023challenges,blake2023mri}. Brain magnetic resonance imaging (MRI) data was obtained from The Cancer Imaging Archive, as published by~\cite{buda2019association} and publicly available on Kaggle\footnote{\url{https://www.kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation}}. 3,929 slices were extracted from 110 scans, each slice either containing tumor or having no tumor. As they were gathered from five distinct US institutions, the instrumentation and acquisition protocols may have varied. 
\begin{table}[ht]
    \centering
    \begin{tabular}{r|r|r|r||r|r|r}
    \toprule\toprule
    \multirow{3}{*}{Datasets} & \multicolumn{6}{c}{Masking Values}\\
 %   \midrule
     & \multicolumn{3}{c||}{0} & \multicolumn{3}{c}{real}\\
    \cmidrule(lr){2-7} 
     & $\beta$ & $\beta_p$ & $\beta_a$ & $\beta$ & $\beta_p$ & $\beta_a$ \\ 
     \midrule\midrule
       Brain  & 0.87 & 0.87 & 0.85 & 0.44 & 0.43 & 0.4  \\
       Lung  & 0.96 & 0.94 & 0.85 & 0.96 & 0.93 & 0.83  \\
       Pancreas  & 0.89 & 0.88 & 0.62 & 0.81 & 0.79 & 0.66  \\
    \end{tabular}
    \caption{\betagood of grids over three different databases, using two different grid color values.}
    \label{tab:results}
\end{table}
The data for the lung and pancreatic cancer dataset was obtained from 'The Medical Segmentation Decathlon' challenge -- a publicly available dataset designed to be more difficult than many existing publicly available medical datasets~\citep{antonelli2022medical}. From this, 17,657 slices were extracted from 96 CT scans and 26,719 slices from 420 CT scans respectively. From these slices, we chose $4000$ healthy images uniformly at random for evaluation. Both datasets were included in the challenge for the small size of the tumors. For both datasets, a ResNet18 model was trained. For all three datasets, we created a causal explanation database using \rex. All explanations were saved to sql database for efficient querying.
All the models were trained as binary classifiers (tumor or no tumor). These models are not designed to be clinically useful: our goal is to generate and automatically assess the quality of explanations for absence. Hence, we did not attempt to optimize the performance of our models nor make them generalizable to out-of-distribution data.  Explanations of absence is not computationally expensive. All experiments were run on an Ubuntu 20.04 server with an Nvidia A40 GPU. With the explanations cached in advance, an individual grid calculation and \betagood evaluation takes in the order $<1$ second. 

\Cref{tab:results} summarizes the results. A masking value of $0$ performs well on all datasets and models. The real values paint a more mixed picture. Interestingly, the model which performs least well on real values in the brain dataset. As this is the
only dataset in true color, this suggests that the model is more sensitive to exact pixel values than models for lung and pancreas cancers. 
Both lung and pancreas datasets are CT data, treated as pseudo-RGB for the purposes of \rex. 

The parameters $\beta$, $\beta_p$, and $\beta_a$ assess the quality of provided explanations \emph{wrt} the model and the dataset. $\beta$ does not take model
confidence into account, hence it just shows the fraction of inputs in $\K$ for which superimposing the partial absence grid changes the classification
from $1$ to $0$. $\beta \leq 1$, and it is lower than one due to the approximations of minimal explanations and the assumption location independence.
On images where superimposing the grid does not change the classification, 
there must be sufficient information left in the image for the model to still classify it as positive. 

Considering model confidence has a significant effect on some of the models.
While $\beta_p$, on these datasets, is generally similar to $\beta$, $\beta_a$ indicates that the pancreas model is less confident about its predictions, 
as can be seen by the relatively low $\beta_a \approx 0.64$, compared to $\beta$ on both masking values. 
This $\beta_a$ may be of more use for models returning low confidence classifications. 