\section{On the Application to Image Explanations}
\label{sec: Experimental Setup}
We use the proposed Sign Entropy regularization method to provide consistent explanations for images. Additionally, to generate perturbed images for learning the surrogate model, we used Adaptive-blur from \citep{Bora_2024_CVPR}. We conduct a series of experiments to demonstrate the applicability of the proposed approach to obtain consistent explanations. Two pre-trained image classification models - InceptionV3 \citep{szegedy2016rethinking} and ResNet50 \citep{he2016deep} initialized with ImageNet weights on the Oxford-IIIT Pet Dataset \cite{parkhi12a} and Pascal VOC 2007 \citep{pascal-voc-2007} dataset were used to evaluate the proposed approach\footnote{We use a kernel size of (5,5) for Gaussian Blur (similar to \cite{Bora_2024_CVPR}). The code for SOTA methods were obtained from the github repositories: BayLIME \citep{baylimerepo}, LIME and SLICE \citep{Bora_2024_CVPR}.}.


We compare our method against LIME, BayLIME and SLICE. We use BayLIME with Grad-CAM as prior for our comparison, as it was demonstrated to have superior consistency and fidelity in comparison to LIME \citep{zhao2021baylime}. We randomly selected 50 images from each of the mentioned datasets and analyze both DL models for 20 repeated and distinct runs. We computed the consistency and fidelity scores for each image-model and averaged them across all the 20 distinct runs. We follow similar settings as outlined in \citep{Bora_2024_CVPR} to conduct our ablation study as presented in \Cref{tab:ablation_components}. Statistical significance of our findings is provided using Wilcoxon Signed Rank test \citep{virtanen2020scipy} benefiting from a non-parametric nature of the test. The threshold of p-value to reject the Null Hypothesis is set at the commonly used threshold of 0.05 and to measure the effect size, we have employed Common Language Effect Size (CLES) \citep{mcgraw1992common} \citep{vargha2000critique}. 

\subsection{Evaluation Metrics}
\label{sub:evaluation_metrics}
For a fair comparison of consistency and fidelity of our proposed 
 approach with the SOTA approaches we use the Combined Consistency Metric (CCM) from \citep{Bora_2024_CVPR}. CCM is defined as below:
\[
CCM_{M, I}^{xp} = (1-ASFE_{M, I}^{xp})*ARS_{M, I}^{xp}
\label{eqn:eqn-10}
\]
\noindent where, $ASFE_{M, I}^{xp}$ denotes the Average Sign Flip Entropy of the coefficients and $ARS_{M, I}^{xp}$ denotes the Rank Similarity of the superpixels in the explanations for a model $M$ and image $I$. $ASFE_{M, I}^{xp}$ ranges from 0 to 1 with a lower value indicating more consistency and $ARS_{M, I}^{xp}$ ranges from 0 to 1 with a higher value indicating better consistency. $CCM_{M, I}^{xp}$ ranges between [0,1] where 0 denotes low consistency and 1 denotes full consistency in both Sign Entropy and superpixel importance ranks (Details provided in \Cref{sup:definitions} in supplementary). Further, adapted Area Under Perturbation Curve (AOPC) \citep{Bora_2024_CVPR} and Insertion and Deletion Area Under the Curve metrics \citep{petsiuk2018rise} are used for measuring the fidelity of explanations.