%\section{Results and Discussion}
\section{Consistency of BELIEF}
\label{sec:results-discussion}


\begin{figure}[htp]
\centering
\includegraphics[width=0.45\textwidth]{figures/ecdf_ccm_belief_slice_lime_baylime_oxpets.pdf}
\caption{ECDF plot of CCM Scores for BELIEF, LIME, BayLIME and SLICE (higher score is better)}
\label{fig:ecdf_ccm_oxpets}
\end{figure}


The  Empirical Cumulative Distribution Function (ECDF) plots of the CCM scores for both models with Oxford-IIIT Pets dataset are provided in \Cref{fig:ecdf_ccm_oxpets} (refer \Cref{fig:ecdf_ccm}, and  \Cref{fig:combined_score} in supplementary for exhaustive ECDF and Density plots). BayLIME and LIME have much lower CCM scores than BELIEF and SLICE. The results empirically indicate that our proposed approach performs at par, in terms of consistency, with SLICE without the additional step of feature selection. 


Further, we conducted the Wilcoxon signed-rank test to ascertain that the higher CCM scores of BELIEF as compared to LIME and BayLIME are statistically significant. The p-values from the Wilcoxon Signed Rank tests were low (in the range of 8.9e-16 to 2.2e-11), the Test Statistics were high (in the range of 1227 to 1275) and effect sizes were large (in the range of 0.96 to 1) (refer \Cref{tab:wilcoxon_results_ccm} in supplementary for test details). The notably low p-value and the substantially high value of the Test Statistic provide robust statistical evidence to reject the null hypothesis. Further, the large effect sizes indicate that the higher CCM scores of BELIEF explanations were not only statistically significant but also practically meaningful.


\begin{table}[htp]
\centering
\caption{Ablation settings with BELIEF and  SLICE variants}
\label{tab:ablation_components}
\resizebox{0.35\textwidth}{!}{
\begin{tabular}{lcc}
\hline
\hline
{Method} & {Feature} & {Adaptive-Blur}\\ 
& Elimination & \\ \hline
SLICE\_blur & \xmark &  \cmark\\
SLICE\_FE & \cmark &  \xmark\\ 
SLICE & \cmark & \cmark\\ 
BELIEF & \cmark & \cmark\\
BELIEF\_FE & \cmark & \xmark\\
\hline
\hline
\end{tabular}
}
\end{table}

\begin{figure}[h]
\centering
\includegraphics[width=0.41\textwidth]{figures/ecdf_ccm_belief_belief-fe_slice_sliceblur_sliceblurfe_oxpets.pdf}
\caption{ECDF plot of CCM Scores for BELIEF, BELIEF\_FE, SLICE\_blur, SLICE\_FE and  SLICE (higher is better)}
\label{fig:ablation_study_oxpets}
\end{figure}

\subsection{Ablation Study for BELIEF}
\label{subsub:ablation_study}
In our ablation study, we evaluate BELIEF in settings similar to SLICE, i.e., with (BELIEF) and without (BELIEF\_FE) adaptive blur as noted in \Cref{tab:ablation_components}. As our proposed approach enforces sparsity using Sign Entropy regularization, BELIEF does not have a counterpart for SLICE\_blur for ablation studies. The ECDF plots of the CCM scores of all the five variants of BELIEF and  SLICE on Oxford-IIIT pets dataset are presented in \Cref{fig:ablation_study_oxpets} (refer \Cref{fig:ablation_study} in supplementary for exhaustive plots). With similar ECDF plots, it can be seen that BELIEF and SLICE performed the best followed by SLICE\_blur, while  BELIEF\_FE and SLICE\_FE performed the worst. This supports our idea that using Sign Entropy as a regularization technique in the Bayesian paradigm can achieve the same consistency as SLICE without the need for an additional feature selection step.

Similarly, BELIEF\_FE and SLICE\_FE have higher CCM scores than LIME but lower than that of BELIEF,  SLICE and SLICE\_blur as seen in \Cref{fig:ecdf_ccm_oxpets}, and \Cref{fig:ablation_study_oxpets} (refer to exhaustive ECDF plots in \Cref{fig:ablation_study} and density plots in \Cref{fig:ablation_study_density} of supplementary). Without Adaptive-Blur, the created perturbed images, used in building the surrogate model, are significantly different from the original image making it difficult for BELIEF\_FE and SLICE\_FE to estimate the Sign Entropy of the superpixels/segments. However, when we combine Adaptive-blur with Sign Entropy regularization (or Sign Entropy based feature selection in SLICE) the estimation of Sign Entropy is more accurate leading to the proper elimination of inconsistent features/superpixels. We further performed Wilcoxon Signed Rank tests to ascertain the statistical significance of our claims. The low p-values (8.9e-16 to 4.7e-3) and high value of Test Statistics (904 to 1275) provide robust statistical evidence that the CCM scores of BELIEF is higher than LIME, BayLIME, SLICE\_blur and SLICE\_FE. Further, the effect size (close to 1 in most cases) support that our observations are statistically significant and practically meaningful. The details of the tests are in \Cref{tab:wilcoxon_results_ccm_ablation} of the supplementary material. 

\section{Fidelity Evaluation of BELIEF Explanations}
\label{sub:fidelity_evaluation}

\subsection{Area Under Perturbation Curve (AOPC)}

\begin{figure}[htp]
\centering
\begin{subfigure}[t]{0.45\textwidth}
  \centering
  \includegraphics[width=\textwidth]{figures/aopc_del_oxpets.png}
  \caption{ECDF plots of AOPC deletion scores}
  \label{fig:aopc_del}
\end{subfigure}%
\hfill
\begin{subfigure}[t]{0.45\textwidth}
  \centering
  \includegraphics[width=\textwidth]{figures/aopc_ins_oxpets1.png}
  \caption{ECDF plots of AOPC insertion scores}
  \label{fig:aopc_ins}
\end{subfigure}
\caption{ECDF plots of AOPC (Higher AOPC indicates higher fidelity)}
\label{fig:ecdf-plots-aopc-ins-del_oxpets}
\end{figure}

\begin{figure}[htp]
\centering
\begin{subfigure}[t]{0.45\textwidth}
  \centering
  \includegraphics[width=\textwidth]{figures/del_oxpets.png}
  \caption{ECDF plots of AUC deletion scores (Lower is better)}
  \label{fig:aopc_del}
\end{subfigure}%
\hfill
\begin{subfigure}[t]{0.45\textwidth}
  \centering
  \includegraphics[width=\textwidth]{figures/ins_oxpets1.png}
  \caption{ECDF plots of AUC insertion scores (Higher is better)}
  \label{fig:aopc_ins}
\end{subfigure}
\caption{ECDF plots of AUC scores for Oxford-IIIT Pets Dataset for Inception V3 and ResNet50}
\label{fig:ecdf-plots-auc-ins-del_oxpets}
\end{figure}



The AOPC scores for BELIEF was higher than that of LIME and BayLIME as seen in \Cref{fig:ecdf-plots-aopc-ins-del_oxpets} for Oxford-IIIT Pets Dataset (Refer \Cref{fig:aopc_del} and \Cref{fig:aopc_ins} in Supplementary for both datasets). The AOPC scores are low as it is based on the difference between the output probability of the unperturbed image and the perturbed images. We performed Wilcoxon Signed Rank tests along with effect size calculation to ascertain that the higher AOPC scores of BELIEF explanations, as compared to those of LIME and BayLIME, were statistically significant and practically meaningful. In our tests, as shown in \Cref{tab:wilcoxon_results_aopc_oxpets} for Oxford-IIIT Pets dataset, the p-values were low, the test statistics were high and the effect sizes were close to 1. The details of test results for both the datasets are provided in \Cref{tab:wilcoxon_results_aopc} of supplementary material. These provide robust statistical evidence confirming that the AOPC scores of BELIEF were significantly higher than those of LIME and BayLIME and are practically meaningful.

\begin{table}[h]
\centering
\caption{Wilcoxon signed rank test results for comparison of LIME, BayLIME, and BELIEF. For a given pair (x,y), the null hypothesis $H_{0}$ was "The median of the differences ($metric(x) - metric(y)$) is equal to zero," and the alternative hypothesis was $H_{a}$ was "The median of the differences ($metric score(x) - metric score(y)$) is greater than zero". [BELIEF(B), LIME(L), and BayLIME(Ba); D:M denotes Dataset:Model; O refers to Oxford-IIIT Pets dataset. R denotes ResNet50 and I denotes Inception V3 models. W denotes the Test Statistic and CLES denotes the Common Language Effect Size.}
\label{tab:wilcoxon_results_aopc_oxpets}
\resizebox{0.3\textwidth}{!}{
\begin{tabular}{lllll}
\hline
\hline
\textbf{Test} & \textbf{D:M} & \textbf{W} & \textbf{p-value} & \textbf{CLES} \\
\hline
\multicolumn{5}{c}{AOPC Insertion}\\
\hline
B, L & O:I & 1229 & 1.7e-11 & 0.892 \\ 
B,Ba & O:I & 1188 & 1.4e-09 & 0.886 \\ 
B,L & O:R & 1040 & 2.6e-05 & 0.756 \\ 
B,Ba & O:R & 1057 & 1.1e-05 & 0.753 \\ 
\hline
\multicolumn{5}{c}{AOPC Deletion}\\
\hline
B,L & O:I & 1231 & 1.3e-11 & 0.889 \\ 
B,Ba & O:I & 1184 & 2.0e-09 & 0.879 \\ 
B,L & O:R & 1040 & 2.6e-05 & 0.758 \\ 
B,Ba & O:R & 1054 & 1.3e-05 & 0.753 \\ 
\hline
\hline
\multicolumn{5}{c}{AUC Insertion}\\
\hline
B,L & O:I & 1230 & 1.5e-11 & 0.898 \\ 
B,Ba & O:I & 1190 & 1.2e-09 & 0.885 \\ 
B,L & O:R & 1051 & 1.5e-05 & 0.767 \\ 
B,Ba & O:R & 1076 & 4.0e-06 & 0.766 \\ 
\hline
\multicolumn{5}{c}{AUC Deletion}\\
\hline
L,B & O:I & 1230 & 1.5e-11 & 0.894 \\ 
Ba,B & O:I & 1186 & 1.7e-09 & 0.883 \\ 
L,B & O:R & 1056 & 1.2e-05 & 0.767 \\ 
Ba,B & O:R & 1068 & 6.1e-06 & 0.764 \\ 
\hline

\hline
\end{tabular}
}
\end{table}


\subsection{Deletion and Insertion Game}
We additionally analyze Insertion and Deletion AUC for fidelity evaluation \citep{petsiuk2018rise} for BELIEF, LIME, and BayLIME. A higher area under the curve (AUC) of the insertion graph indicates higher fidelity of explanations. Conversely, in the deletion procedure, a lower AUC of the deletion graph indicates higher fidelity. 

We present the ECDF plots of the AUC insertion and AUC deletion scores for Oxford-IIIT Pets datasets for both models in \Cref{fig:ecdf-plots-auc-ins-del_oxpets}. It can be observed on the top row of \Cref{fig:ecdf-plots-auc-ins-del_oxpets} that the AUCs for the deletion procedure of BELIEF explanations were lower than those of LIME and BayLIME explanations (Refer \Cref{fig:del_auc} in supplementary for ECDF plots for both datasets). We performed the Wilcoxon signed rank tests on the AUCs obtained for all three methods on Oxford-IIIT Pets dataset to confirm this observation (refer \Cref{tab:wilcoxon_results_aopc_oxpets} for test details). Extremely low p-values and high test statistics in all scenarios indicate robust statistical evidence to reject the Null Hypothesis. This confirms that the AUC deletion scores of LIME and BayLIME were much higher than that of BELIEF. Further, the large effect size, i.e., 0.76 to 0.89 proves the practical implications of the same. 

Similarly, the higher AUC insertion scores of BELIEF can be seen in the lower row of the ECDF plot in  \Cref{fig:ecdf-plots-auc-ins-del_oxpets} (Refer \Cref{fig:ins_auc} in supplementary for ECDF plots for both datasets) and the details of statistical test in \Cref{tab:wilcoxon_results_aopc_oxpets}. The results from our tests provide robust statistical evidence confirming that the explanations of BELIEF are significantly superior than those of LIME and BayLIME in terms of fidelity and at par with SLICE. The detailed results on both datasets can be found in supplementary material (\Cref{tab:wilcoxon_results_auc}).  

\section{Comparison of BELIEF and SLICE}
%We compare BELIEF with SLICE to benchmark in terms of consistency, fidelity and speed in this section

\subsection{Consistency Comparison}
\label{subsub:consistency_comparison}
BELIEF and SLICE have almost the same distribution of CCM scores Oxford-IIIT Pets dataset and both model as shown in \Cref{fig:ecdf_ccm_oxpets} (refer \Cref{fig:ecdf_ccm} in supplementary material for ECDF plots of both datasets). However, to confirm that there is no significant difference in their CCM scores, we conducted a Wilcoxon Signed Rank test as shown in \Cref{tab:wilcoxon_results_belief_vs_slice_ccm_oxpets} for Oxford IIIT Pets datatset. We fail to reject the Null Hypothesis ("The median of the differences ($CCM\ score(\text{BELIEF}) - CCM\ score(\text{SLICE})$) is equal to zero.") as the p-values are much larger than the commonly accepted threshold of 0.05 indicating insufficient statistical evidence to prove that the CCM scores of BELIEF and SLICE are different.

\begin{table}[h]
\centering
\caption{Wilcoxon Signed Rank test results comparing CCM scores of BELIEF(B) and SLICE(S) with a two-sided alternative hypothesis. In each test, the null hypothesis $H_{0}$ was "The median of the differences ($CCM\ score(\text{BELIEF}) - CCM\ score(\text{SLICE})$) is equal to zero." and the alternative hypothesis was $H_{a}$ was "The median of the differences ($CCM\ score(\text{BELIEF}) - CCM\ score(\text{SLICE})$) is not equal to zero". D:M denotes Dataset:Model, where O refers to Oxford-IIIT Pets and P refers to PASCAL VOC datasets. R denotes ResNet50 and I denotes Inception V3 models. W represents the Test Statistic, and CLES denotes the Common Language Effect Size.}
\label{tab:wilcoxon_results_belief_vs_slice_ccm_oxpets}
\begin{adjustbox}{width=0.35\textwidth}
\begin{tabular}{lllll}
\hline
\hline
\textbf{Test} & \textbf{D:M} & \textbf{W} & \textbf{p-value} & \textbf{CLES} \\
\hline
B, S & O:I & 501 & 0.19 & 0.392 \\
B, S & O:R & 422 & 0.37 & 0.432 \\
B, S & P:I & 552 & 0.42 & 0.392 \\
B, S & P:R & 493 & 0.17 & 0.428 \\
\hline
\hline
\end{tabular}
\end{adjustbox}
\end{table}


\subsection{Fidelity Comparison}
\label{subsub:fidelity_comparison}
We further see that the distribution of the fidelity scores are similar for BELIEF and SLICE as shown in \Cref{fig:ecdf-plots-aopc-ins-del_oxpets} and \Cref{fig:ecdf-plots-auc-ins-del_oxpets} for Oxford-IIIT Pets dataset for both Inception V3 and ResNet50 models (refer \Cref{fig:ecdf-plots-aopc-ins-del} and \Cref{fig:ecdf-plots-ins-del} in supplementary for ECDF plots for both datasets). The high p-values observed in \Cref{tab:wilcoxon_results_belief_slice_fidelity_oxpets} which are much greater than the commonly accepted threshold of 0.05 indicate that we fail to reject the Null Hypothesis. Hence, we conclude that there is not enough statistical evidence to prove that the fidelity scores of BELIEF and SLICE are different. (Refer \Cref{tab:wilcoxon_results_belief_slice_fidelity} in supplementary for test details on both datasets).

\begin{table}[htp]
\centering
\caption{Wilcoxon signed rank test results for comparison of BELIEF(B) and SLICE(S). metric(B,S) indicates the test where the null hypothesis $H_{0}$ was "The median of the differences ($metric score(\text{BELIEF}) - metric score(\text{SLICE})$) is equal to zero," and the alternative hypothesis was $H_{a}$ was "The median of the differences ($metric\ score(\text{BELIEF}) - metric\ score(\text{SLICE})$) is not equal to zero". AOPC and AUC are the metrics, D:M denotes Dataset:Model; O refers to Oxford-IIIT Pets and P refers to PASCAL VOC datasets. R denotes ResNet50 and I denotes Inception V3 models. W denotes the Test Statistic and CLES denotes the Common Language Effect Size.}
\label{tab:wilcoxon_results_belief_slice_fidelity_oxpets}
\resizebox{0.30\textwidth}{!}{
\begin{tabular}{lllll}
\hline
\hline
\textbf{Test} & \textbf{D:M} & \textbf{W} & \textbf{p-value} & \textbf{CLES} \\
\hline
\multicolumn{5}{c}{AOPC Insertion}\\
\hline
B,S & O:I & 590 & .65 & 0.538 \\ 
B,S & O:R & 557 & .44 & 0.535 \\ 
\hline
\multicolumn{5}{c}{AOPC Deletion}\\
\hline
B,S & O:I & 589 & .65 & 0.537 \\ 
B,S & O:R & 567 & .50 & 0.528 \\ 
\hline
\multicolumn{5}{c}{AUC Insertion}\\
\hline
B,S & O:I & 589 & .65 & 0.535 \\ 
B,S & O:R & 546 & .38 & 0.546 \\ 
\hline
\multicolumn{5}{c}{AUC Deletion}\\
\hline
B,S & O:I & 591 & .66 & 0.462 \\ 
B,S & O:R & 553 & .42 & 0.460 \\ 
\hline
\hline
\end{tabular}
}
\end{table}

\subsection{Runtime Comparison}

The main difference between BELIEF and  SLICE is that BELIEF uses our proposed novel Sign Entropy regularization. In contrast, SLICE uses the frequentist Ridge Regression with bootstrapping to eliminate features with high Sign Entropy making it slow. Further, the main component that takes the highest time is running the predict function on the perturbed sample images generated around the IE. A larger sample size would require more calls to predict, thus increasing the overall execution time.  

We therefore analyze the computation advantage of BELIEF as compared to SLICE. To demonstrate the computational advantage of BELIEF, we ran  SLICE for 100 random images from Oxford-IIT Pets and PASCAL VOC datasets. We noted the number of calls to the predict function and the execution time for each image and calculated their Pearson correlation. The Pearson correlation for SLICE using ResNet50 was $0.9959$, and for Inception V3, it was $0.9953$, proving that our assumption regarding the direct impact of sample size on execution time is valid. 

Further, we fixed the sample size for BELIEF at 500 for all our experiments and were able to achieve comparable results in consistency and fidelity as compared to SLICE (which used $\approx 2500$ samples for ResNet50 and $\approx 3000$ samples for Inception V3, as shown in \Cref{subsub:consistency_comparison} and \Cref{subsub:fidelity_comparison}). We present the sample sizes used by SLICE for all the images for ResNet50 and Inception V3 models in \Cref{fig:runtime}. The median number of samples required for SLICE to stabilize the explanations for ResNet50 was $2500$ and for Inception V3 was $3000$, which are much larger ($\approx 5X$ times) than that of BELIEF. Based on the distribution information, we employed Kernel Density Estimation (KDE) to estimate the probability of SLICE to have a sample size of 500 or less. We used Scott's method \citep{scott2015multivariate} of calculating the bandwidth for the same. The probabilities for SLICE to have less than or equal to 500 sample sizes was $5.058e-03$ for ResNet50 and $1.240e-06$ for Inception V3. BELIEF was therefore able to stabilize LIME explanations with a much smaller sample size and lower average execution time as shown in \Cref{tab:runtime}. While the running time for BELIEF is comparable to LIME and BayLIME, it provides a high consistency comparable to SLICE.
\begin{figure}[h]
\centering
\includegraphics[width=0.45\textwidth]{figures/runtime.pdf}
\caption{Distribution of sample sizes of  SLICE for 100 random images with ResNet50 and Inception V3 models with dotted lines of corresponding colors denoting the respective median values. The sample size of BELIEF is denoted using the blue dotted line at 500. BELIEF uses a much smaller sample size as compared to  SLICE to achieve comparable Consistency and Fidelity of Explanations.}
\label{fig:runtime}
\end{figure}


\begin{table}[htp]
%\renewcommand{\arraystretch}{1.1}
\label{tab:runtime}
\centering
\caption{Median running time (lower is better) and CCM scores (higher is better) of BELIEF, SLICE, BayLIME and LIME (in seconds per image) for Inception V3 and Resnet50 models. The values are calculated by running the four methods on 100 randomly sampled images from Oxford-IIIT Pets and PASCAL VOC 2007 datasets for both Resnet50 and Inception V3 models. The median Runtime and CCM scores were computed by aggregating values from both datasets.}
\label{tab:runtime}
\resizebox{0.48\textwidth}{!}{
\begin{tabular}{lcccc}
\hline
\textbf{Method} & \textbf{Runtime $\downarrow$}& \textbf{CCM $\uparrow$} & \textbf{Runtime $\downarrow$}& \textbf{CCM $\uparrow$} \\
\cline{2-5} &\multicolumn{2}{c}{Inception V3} & \multicolumn{2}{c}{ResNet50}\\
\hline
LIME & 5.06 &  0.232 & 3.43 &  0.363  \\ 
BayLIME & 5.04 &  0.312 & 3.38 & 0.501  \\ 
SLICE  & {\color{red} 50.53} & 0.999 & {\color{red} 30.32} & 0.999  \\ 
BELIEF & 5.04 & 0.998 & 3.39 & 0.999  \\
\hline
\end{tabular}
}
\end{table}

