\section{Related work}
    \label{suppl:related-work}
    Explainable AI (XAI) methods for image analysis can be categorized into attribution and non-attribution-based approaches \cite{he2022survey, hossain2023explainable}. 
    Attribution-based methods aim to explain \enquote{where} important features are located by generating saliency maps that assign importance scores to pixels or regions. In contrast, non-attribution methods focus on explaining \enquote{why} a decision was made, typically providing global explanations using approaches such as prototype- or concept-based explanations \cite{chen2019looks, koh2020concept}.
    Attribution-based methods usually provide post-hoc local explanations for black-box models after training. In contrast, non-attribution-based methods are generally inherently interpretable by design \cite{chen2019looks, brendel2018bagnets, koh2020concept, bohle2022b}, embedding transparency within their architecture.
    However, inherently interpretable models are typically model-agnostic and often face reduced performance, reflecting a persistent trade-off between interpretability and predictive performance \cite{rudin2019stop}. To bridge this gap, we propose SoftCAM, a protocol that makes traditional black-box CNNs inherently interpretable. Our approach builds on and generalizes prior work.
    
    \citet{aubreville2019transferability} proposed a dual-branch architecture in which the first branch performed classification using a black-box CNN, while the second branch generated post-hoc explanations, requiring two separate forward passes. In the explanation branch, the global average pooling (GAP) layer is removed and the linear classifier is replaced with convolutional layers that share weights with the main classifier during inference, producing class-specific activation maps.
    In contrast, SoftCAM integrates both prediction and explanation within a single, end-to-end forward pass, eliminating the need for redundant computation and weights sharing.
    Similarly, \cite{donteu2023sparse} leveraged a convolutional classifier to generate explicit class-evidence maps and applied a sparsity constraint to enhance interpretability in a self-explainable Bag-of-Local-Features model (BagNet) \cite{brendel2018bagnets}. SoftCAM extends and generalizes this concept to standard black-box CNNs, broadening its applicability to various medical imaging tasks and modalities. 

\section{Datasets}
    \label{suppl:dataset}
    We evaluated our approach on three publicly available medical imaging datasets spanning three different modalities: the Kaggle Diabetic Retinopathy (DR) \cite{kaggle_dr_detection}, the Retinal OCT dataset \cite{kermany2018identifying}, and the RSNA Chest X-ray (CXR) dataset \cite{rsna_dataset}.
    \begin{itemize}
        \item \textbf{Kaggle DR Dataset.} This dataset comprises $88,702$ high-resolution retinal fundus images labeled for DR severity on a 5-point scale from $0$ (No DR) to $4$ (Proliferative DR). After applying an automated quality filtering pipeline using an ensemble of EfficientNet models trained on the ISBI2020\footnote{\url{https://isbi.deepdr.org/challenge2.html}} challenge dataset, we retained $45,923$ images from $28,984$ subjects. The resulting class distribution was $73\%$, $15\%$, $8\%$, $3\%$, and $1\%$ for classes 0-4 respectively.
        For binary classification (early DR detection), we grouped class \{0\} vs. \{1,2,3,4\}, yielding an imbalance of $73\%$ vs. $27\%$. Additionally, lesion annotations for $65$ images were obtained from  \citet{djoumessi2024inherently} for evaluating the model's explanations at localizing DR-related lesions.
        \item \textbf{Retinal OCT Dataset.} This dataset consists of $108,315$ B-scans categorized into four classes: Drusen, Diabetic macular edema (DME), Choroidal neovascularization (CNV), and Normal. A separate test set of $1,000$ B-scans is provided. Following \citet{djoumessi2024actually}, we excluded low-resolution scans (width $\le$ 496).
        As preliminary experiments showed that using the full dataset did not significantly improve performance, we subsampled the training set (by randomly removing half of the healthy images following \citet{djoumessi2024actually}) to $34,962$ scans ($8,616$ Drusen, $26,346$ Normal) for binary classification (Drusen vs. Normal), preserving the original class imbalance ($73\%$ vs $27\%$). Additionally, we used $40$ drusen-annotated B-scans from \citet{djoumessi2024actually} to evaluate the model's explanations at localizing drusen lesions.  
        For the multi-class classification task, the training was randomly reduced to $17,200$ images while maintaining the original class distribution: $45\%$ Normal, $34\%$ CNV, $10\%$ DME, and $9\%$ Drusen. 
        \item \textbf{RSNA Chest X-ray Dataset.} This dataset includes $30,227$  frontal-view chest radiographs labeled as ``Normal'', ``No Opacity/Not Normal'', and ``Opacity'' (indicative of pneumonia). Pneumonia cases come with bounding box annotations, which facilitate the evaluation of the model’s explanations. For our binary classification task, we selected images labeled as either ``Normal'' or ``Opacity'', resulting in $14,863$ images with a $60\%$ vs. $40\%$ class distribution.
    \end{itemize}

    Each dataset was split into training ($75\%$), validation ($10\%$), and test ($15\%$) sets, except for the Retinal OCT dataset, which followed an $80\%\text{-}20\%$ training-validation split, due to its predefined test set ($250$ images per class). All training splits used in our experiments are provided in CSV format and publicly available via the project’s GitHub repository\footref{github}. 

\section{Implementation details}
    \label{suppl:training_setup}
    \subsection{Baseline models}
        The effectiveness of our method was evaluated using two widely adopted black-box CNN architectures: ResNet-50 and VGG-16. These models were chosen due to their distinct architectures, such as depth, theoretical receptive field size, and classification head design, which allow for a broad assessment of our method’s generalizability. 
        In both models, the standard classification head was systematically replaced with our proposed convolution-based evidence map layer to enable inherent interpretability. 
        
        For ResNet50, we removed the global average pooling layer and final fully connected layer (FCL), substituting them with a class evidence layer consisting of $C$ convolutional filters ($1 \times 1$, stride 1), where $C$ is the number of output classes. This layer directly produces class-specific evidence maps (Sec.\,\ref{extendingCAM}).        
        For VGG-16, whose classifier head consist of several fully connected layers, each FCL was replaced by an equivalent $1 \times 1$ convolutional layer. Specifically, an FCL of size $b_1 \times b_2$ was transformed into a convolutional layer of size $b_1 \times b_2 \times 1 \times1$, preserving the original parameter count and model capacity. 
        These architectural changes maintain model complexity and capacity while introducing interpretability directly into the classification mechanism.

    \subsection{Data preprocessing and augmentation}
        Fundus images were preprocessed by cropping them to a square shape using a circle-fitting method as described in \cite{Mueller_fundus_circle_cropping}. All datasets were then resized to $512 \times 512$ pixels, except for the retinal OCT dataset, which was resized to $496 \times 496$ to better match its original lower resolution. Image intensities were normalized using the mean and standard deviation computed from the respective training sets. 
        
        During training, standard transformations were applied across all datasets. These included flipping, rotation, random cropping, and translation, each applied with a fixed probability. 
        For the Kaggle dataset, which contains color fundus images, additional color augmentations were introduced to improve generalization.  

    \subsection{Training setup}
        All models were sourced from Torchvision and initialized with pretrained weights from ImageNet. They were subsequently fine-tuned on each dataset using a consistent training setup\footnote{\label{github} The code is available at \url{https://github.com/kdjoumessi/SoftCAM}}. 
        Following \citet{djoumessi2024actually} and \citet{donteu2023sparse}, we employed the cross-entropy loss function and optimized model parameters using stochastic gradient descent with Nesterov momentum (momentum factor of $0.9$). 
        The initial learning rate was set to $1.10\text{-}3$, and a clipped cosine annealing learning rate scheduling was applied with the minimum learning rate set to $1.10^{-4}$. Weight decay was set to $5.10^{-4}$.
        The training was conducted for $70$ epochs with a mini-batch size of $16$ on an NVIDIA A40 GPU using PyTorch.
        
\section{Baseline post-hoc methods}
    \label{suppl:cam-based-methods}
    SoftCAM was benchmarked against widely used post-hoc methods, spanning CAM-based and backpropagation-based approaches. 
    CAM-based methods produce coarse-grained explanations due to their reliance on low-resolution convolutional feature maps, whereas backpropagation-based techniques provide fine-grained pixel-level attributions that perverse full spatial resolutions.
    CAM-based methods include gradient-based approaches such as GradCAM and LayerCAM, as well as the gradient-free method ScoreCAM.
    Backpropagation-based techniques include Integrated Gradients and Guided Backpropagation. 
    Gradient-based methods primarily differ in how they aggregate gradients to compute importance weights, while gradient-free methods vary in how these weights are estimated without backpropagation.
    
    Guided Backpropagation and Integrated Gradients have consistently performed well in generating saliency maps to explain black-box CNN classifiers on retinal images \cite{ayhan2022clinical, djoumessi2024inherently}, while GradCAM has shown strong localization performance for chest X-ray interpretation \cite{saporta2022benchmarking}. Below is a brief description of the five post-hoc methods used.

    %\paragraph{CAM}
    
    \paragraph{ScoreCAM} \hspace{-0.25cm}\cite{wang2020score}. A gradient-free method that eliminates the need for gradient information by assessing the importance of each activation map based on its forward-pass contribution to the target class score, and produces the final output via a weighted sum of these maps.
    
    \paragraph{LayerCAM} \hspace{-0.25cm}\cite{jiang2021layercam}. A gradient-based method that generates class activation maps by leveraging the element-wise product of ReLU-activated gradients and feature maps at any convolutional layer, enabling fine-grained, spatially precise visual explanations without requiring global average pooling.

    \paragraph{GradCAM} \hspace{-0.25cm}\cite{selvaraju2017grad}. A gradient-based approach that uses the gradients of the target class flowing into the final convolutional layer to produce a coarse localization map, highlighting important regions in the image by upsampling the resulting map.

    \paragraph{Guided backpropagation (Guided BP)} \hspace{-0.25cm}\cite{springenberg2014striving}. A gradient-based approach that modifies the standard backpropagation process to propagate only positive gradients through positive activations, producing fine-grained visualizations that highlight features strongly activating specific neurons in relation to the target output.

    \paragraph{Integrated Gradient (Itgt Grad.)} \hspace{-0.25cm}\cite{sundararajan2017axiomatic}. A gradient-based method that attributes model predictions to input features by computing the path integral of gradients along a straight-line path from a baseline to the actual input, yielding fine-grained explanations.

    ScoreCAM and LayerCAM were implemented with TorchCAM \cite{torcham2020}, while the other methods were implemented from Captum \cite{kokhlikyan2020captum}.

%%%%%%%%%%%%%%%%% Here %%%%%%%%%%%%%%

\section{Explainability metrics}
    \label{suppl:xai-metrics}
    For binary tasks, performance was evaluated using accuracy and AUC, while for multi-class tasks, accuracy and the quadratic Cohen's kappa score ($\kappa$) were used. The AUC measures class separability, whereas the kappa score captures the agreement beyond chance.
    
    Explainability was assessed using several quantitative metrics, including activation consistency \cite{donteu2023sparse},  top-k localization precision \cite{donteu2023sparse}, activation precision \cite{barnett2021case}, further extended to activation sensitivity, and faithfulness \cite{yeh2019fidelity}. Together, these metrics quantify both alignment with expert clinical knowledge (top-k localization, activation precision, and activation sensitivity) and alignment with the model's true decision-making process (faithfulness).

     %- fidelity test for LLM
    \subsection{Activation consistency}
        \label{suppl:activation-consistency}
        The activation consistency \cite{donteu2023sparse} measures how well local explanations (i.e., positive or negative activation in attribution maps) reflect the true disease or healthy labels across a dataset. Intuitively, an interpretable model should consistently show positive activation regions associated with pathology for disease samples, while producing minimal or negative activations for healthy samples. This metric therefore assesses whether explanation maps globally align with the semantic meaning implied by ground-truth labels.
        
        Following \citet{donteu2023sparse}, activation consistency is computed as the proportion of positive activations in the attribution maps of disease samples and the proportion of negative activations in the attribution maps of healthy samples, averaged over the test set. A higher score indicates more coherent and label-aligned explanations.

        Formally, let $\mathcal{D} = \mathcal{D}_{\text{disease}} \cup \mathcal{D}_{\text{healthy}}$ denote the test set, and let be $A_i(x,y)$ be attribution map sample $i$. Given the following indicator functions
        \begin{equation*}
            \mathbbm{1}_{+}(A_i) = \frac{1}{|A_i|} \sum_{(x,y)} \mathbbm{1}\big( A_i(x,y) > 0 \big), \,\,\,\, \mathbbm{1}_{-}(A_i) = \frac{1}{|A_i|} \sum_{(x,y)} \mathbbm{1}\big( A_i(x,y) < 0 \big),
        \end{equation*}
        corresponding to the proportion of positive and negative activations in the attribution map. \textbf{Activation consistency} is then defined as 
        \begin{align}
            AC_{+} = & \frac{1}{|\mathcal{D}_\text{disease}|} \sum_{i \in \mathcal{D}_\text{disease}} \mathbbm{1}_{+}(A_i); \\
            AC_{-} = & \frac{1}{|\mathcal{D}_\text{healthy}|} \sum_{i \in \mathcal{D}_\text{healthy}} \mathbbm{1}_{-}(A_i),
        \end{align}
        where $AC_{+}$ measures how consistently explanations highlight pathological regions in disease samples, $AC_{-}$ measures how consistently they suppress activations in healthy samples. 
        This provides a dataset-level assessment of whether local explanations globally support the model’s classification behavior.

    \subsection{Top-k localization precison}
        Top-k localization precision \cite{donteu2023sparse} measures how well an explanation map highlights clinically relevant regions that overlap with ground-truth annotations. Specifically, it computes the proportion of the top-k positively activated regions that coincide with annotated areas obtained from clinicians. 
        
        Given an explanation map, we first upsampled it to the original input resolution and split into non-overlapping patches of size $33 \times 33$. Each patches is assigned a saliency score equal to the average activation within that patch. The top-k ($k \in [1,30]$) most salient patches are then selected, and the metric computes the fraction of these patches that overlap with annotated ground-truth  regions. This generalizes the \enquote{pointing game} metric \cite{zhang2018top}, which only considers the single most salient region (top-1), making it better suited for medical imaging tasks where disease-relevant features (e.g., retinal lesions or other pathological markers) are often spatially distributed across the image.

        Formally, let $S$ denote the upsampled explanation map and let the image be partitioned into $P$ non-overlapping patches $\{P_1, \ldots, P_p \}$. The average activation of each patch is
        \begin{equation*}
            a_p = \frac{1}{|P_p|} \sum_{(x,y) \in P_p} S(x,y).
        \end{equation*}
        Define the indices of the top-k salient patches as $T_k = \arg \text{top}_k \{a_1, \ldots, a_p\}$. Let $G$ be the binary ground-truth annotation mask. Patch-annotation is defined by 
        \begin{equation*}
            \mathbbm{1}_\text{overlap}(P_p, G) = \begin{cases}
                1, & \text{if } \sum_{(x,y) \in P_p} G(x,y) > 0, \\
                0, & \text{otherwise.}
            \end{cases}
        \end{equation*}
        The \textbf{Top-k localization precision} is then defined as
        \begin{equation}
            \text{Top-k precision} (k) = \frac{1}{k} \sum_{p \in T_k} \mathbbm{1}_\text{overlap}(P_p, G).
        \end{equation}
        A higher score indicates that the explanation consistently highlights clinically meaningful regions identified by experts.

    \subsection{Activation precision and activation sensitivity}
        \label{suppl:activation-prec-sens}
        Let $\mathcal{X} = \{\bf{X}\}_{i=1}^n$ denote a set of input images, $\mathcal{M} = \{\bf{M}\}_{i=1}^n$ their corresponding binary segmentation masks, and $\mathcal{S} = 
        \{\bf{S}\}_{i=1}^n$ the explanation or saliency maps produced by a given method. 
        \textit{Activation precision} measures how much of the positive evidence highlighted by an explanation map falls within the annotated ground-truth region \cite{barnett2021case}. 

        Before computing the metric, saliency maps are preprocessed by thresholding negative values to zero, $\mathbf{S}_i^+ = \max(\mathbf{S}_i, 0)$, ensuring that only positive evidence contributes to the evaluation.
        For each sample, activation precision (AP) is defined as the proportion of positive activation mass contained inside the annotation mask:
        \begin{equation*}
            \mathbf{AP}_i = \frac{\sum_p \mathbf{S}_i^+ (p) \textbf{M}_i(p)}{\sum_p \mathbf{S}_i^+ (p) + \epsilon},
        \end{equation*}
        where $p$ indexes spatial locations and $\epsilon$ is a small constant preventing division by zero. The dataset-level activation precision is then obtained by averaging over all samples:
        \begin{equation}
            \textbf{AP} = \frac{1}{n} \sum_{i=1}^n \textbf{AP}_i .
        \end{equation}
        This metric captures how precisely the explanations signal aligns with expert annotation, providing a clinically meaningful measure of the quality of an explanation method.

        However, activation precision does not penalize false negatives (i.e. missed relevant regions). To address this, we introduce \textit{activation sensitivity}, which captures the completeness of the explanation by computing the fraction of annotated regions that are covered by the saliency map. 
        We defined per-sample activation sensitivity (AS) by computing the proportion of mask pixels that receive any positive activation:
        \begin{equation*}
            \mathbf{AS}_i = \frac{\sum_p \mathbf{S}_i^+ (p) \textbf{M}_i(p)}{\sum_p \mathbf{M}_i (p) + \epsilon},
        \end{equation*}
        The dataset-level activation sensitivity is obtained by averaging over samples: 
        \begin{equation}
            \textbf{AS} = \frac{1}{n} \sum_{i=1}^n \textbf{AS}_i .
        \end{equation}
        Intuitively, $\textbf{AS}_i$ is the fraction of the annotated region that the explanation covers with higher values meaning fewer missed regions.

        Unlike activation precision, activation sensitivity penalizes weak activations inside the annotated region. For example, if $M_i(p)=1$ and $0 < S_i(p) < 1$, the low activation contributes only minimally to the numerator, indicating reduced confidence in that clinically relevant area. 
        This makes activation sensitivity particularly valuable in settings where complete lesion coverage is essential.    

    \subsection{Failthfulness}
        Faithfulness, also referred to as sensitivity or fidelity \cite{yeh2019fidelity}, evaluates how well an explanation reflects the model’s true decision-making process. It assesses whether the importance assigned to input features corresponds to their actual influence on the model’s prediction.
        
        In our implementation, we evaluated faithfulness on correctly classified test samples. Each explanation map is upsampled to the input resolution and split into non-overlapping $33 \times 33$ patches, which are ranked by mean activation values. The top-k most salient patches are iteratively occluded, and after each step we record the relative drop in the model's confidence for the predicted class. 
        This produces a deletion curve from which we computed the Area Under the Deletion Curve (AUDC). A lower AUDC indicates a more faithful explanation, as it reflects a sharper decrease in confidence when the regions deemed important are removed, showing that these regions effectively contributed to the model's decision.

\section{Sparsity regularization selection for the binary tasks}
    \label{suppl:binary-lasso-reg}
    The sparsity regularization coefficient $\lambda_1$ in Eq. \ref{eq:loss} controls the sparsity of the class evidence map, encouraging the model to localize disease regions with high precision. For each task, $\lambda_1$ was selected based on a trade-off between accuracy and AUC on the corresponding validation set, choosing the highest values for which classification performance did not degrade significantly (Fig.\,\ref{fig2:app-vgg-qualitative-vis}).         
    \begin{figure}[H]
        \centering
        \includegraphics[width=.9\textwidth]{figures/Suppl/fig_lasso_bin_reg_all.pdf}
        \vspace{-0.4cm}
        \caption{\textbf{Model selection on validation sets under varying sparsity regularization strengths.} The regularization coefficient $\lambda$ influences model performance, with notable effects on some datasets but minimal impact on the OCT dataset. The red markers indicate the selected $\lambda$ values, chosen to balance sparsity and classification performance.}
        \label{fig2:app-vgg-qualitative-vis}
    \end{figure}

\section{Additionnal Results}
    \paragraph{Visualizing explanations.} The evidence map generated by SoftCAM is upsampled to the input resolution for visualization. Like most CAM-based methods, such as GradCAM, ScoreCAM, and LayerCAM that operate on the final convolutional layer, SoftCAM’s explanations are limited by the resolution of the backbone (e.g., 16×16 for VGG-16/ResNet-50 with 512×512 input) due to pooling and striding, leading to lower-resolution saliency maps. 
    However, by introducing the class evidence and classification layer directly on top of the low-resolution features and applying regularization, SoftCAM improves spatial precision.
    In contrast, gradient-based methods like Integrated Gradients \cite{sundararajan2017axiomatic} and Guided Backpropagation \cite{springenberg2014striving} produce high-resolution saliency maps by computing pixel-level gradients, which may lead to noisy maps, especially when the region of interest spans a broader area, as commonly observed in Chest X-ray images.

    \paragraph{Comparison with other approaches.}
    Unlike post-hoc attribution-based approaches, SoftCAM is inherently interpretable from the classification layer and maintains performance comparable to its black-box counterpart, without a significant trade-off, even when regularization is applied to enhance explainability. 
    Compared to \citet{donteu2023sparse}, SoftCAM extends from interpretable bag-of-local models to general black-box CNN architectures and generalizes the regularization from lasso to ElasticNet, with extensive evaluations across multiple datasets using a broad range of explanability metrics.
    Compared to \citet{aubreville2019transferability}, our method is trained end-to-end and does not require post-hoc processing, weight sharing between branches, or an additional forward pass to generate explanations.
    
    \subsection{SoftCAM provides inherently interpretable visual explanations}
        \label{suppl:fig2-vgg-qualitative-binary}
        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig2_vgg_qualitative_visualization.pdf}
            \vspace{-0.6cm}
            \caption{\textbf{Explanations generated by different methods from VGG-16}. The first column shows disease images with reference annotations, indicated by green markers or bounding boxes. Each row, from top to bottom, corresponds to fundus, OCT, and Chest X-ray images, respectively. The next five columns present saliency maps generated by post-hoc explanation methods. The final two columns showcase our proposed inherently interpretable SoftCAM explanations. %\rev{Due to architectural constraints, the standard CAM method cannot be directly applied to standard VGG networks}
            }
            \label{fig2:app-vgg-qualitative-vis}
            \vspace{-0.5cm}
        \end{figure}

    \subsubsection*{Additional qualitative examples}
        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig2_qualitative_visualization_suppl_resnet.pdf}
            \vspace{-0.4cm}
            \caption{\textbf{Additional qualitative explanations from the ResNet model}.}
        \end{figure}

        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig2_qualitative_visualization_suppl_vgg.pdf}
            \caption{\textbf{Additional qualitative explanations from the VGG model}.}
        \end{figure}

    \subsection{Activation consistency}
        \label{suppl:activation-consistency-results}
        We quantified activation consistency only for the SoftCAM variants, as post-hoc methods are not inherently explainable, meaning their explanations do not directly influence the model’s decision-making process. 
        
        The results align well with qualitative visualizations. On the Fundus dataset, the sparse SoftCAM model exhibits a higher proportion of positive activations with the ResNet backbone, attributed to reduced false positives from the dense model, and fewer negative activations, reflecting the suppression of low-importance activations to zero. On the VGG backbone, regularization primarily reduces false-positive activations from the unregularized model but leads to a slight increase in activations on healthy samples. A similar result can be observed on the RSNA dataset.

        On the OCT dataset, the unregularized SoftCAM with the ResNet backbone generally produces coarse-grained evidence around lesion areas. In contrast, the sparse variant refines these explanations, resulting in lower positive and negative activations across both disease and healthy samples, suggesting more selective and focused localization. However, with the VGG backbone, a higher proportion of negative activations is observed, reflecting the impact of the regularization strength—highlighting the importance of appropriately tuning this parameter for different architectures.

        \begin{table}[H]
            \centering
            \caption{Activation consistency on the ResNet model. $\mathbf{r_{LG}^+}$ denotes the proportion of positive or disease activations from disease images, while $\mathbf{r_{LG}^-}$ refers to the proportion of negative or healthy activations from healthy images.}
            \small %\scriptsize % scriptsize footnotesize
            \begin{tabular}{l|c|c|c|c|c|c}
                & \multicolumn{2}{c|}{Fundus} & \multicolumn{2}{c|}{OCT} & \multicolumn{2}{c}{RSNA} \\
                \hline
                & $\mathbf{r_{LG}^+} \uparrow$ & $\mathbf{r_{LG}^-} \uparrow$ & $\mathbf{r_{LG}^+} \uparrow$ & $\mathbf{r_{LG}^-} \uparrow$ & $\mathbf{r_{LG}^+} \uparrow$ & $\mathbf{r_{LG}^-} \uparrow$   \\
                SoftCAM & $0.28 \pm 0.1$ & $ 0.86 \pm 0.1$ & $0.30 \pm 0.1$ & $0.85 \pm 0.1$ & $0.75 \pm 0.1$ & $0.47 \pm 0.1$ \\ 
                sparse SoftCAM & $0.55 \pm 0.2$ & $0.76 \pm 0.2$ & $0.23 \pm 0.1$ & $0.83 \pm 0.1$ & $0.79 \pm 0.1$ & $ 0.45 \pm 0.1$ \\ 
                \hline
            \end{tabular}
        \end{table}
        
        \begin{table}[H]
            \centering
            \caption{Activation consistency on the VGG model. $\mathbf{r_{LG}^+}$ denotes the proportion of positive or disease activations from disease images, while $\mathbf{r_{LG}^-}$ refers to the proportion of negative or healthy activations from healthy images.}
            %\scriptsize % scriptsize footnotesize
            \small
            \begin{tabular}{l|c|c|c|c|c|c}
                & \multicolumn{2}{c|}{Fundus} & \multicolumn{2}{c|}{OCT} & \multicolumn{2}{c}{RSNA} \\
                \hline
                & $\mathbf{r_{LG}^+} \uparrow$ & $\mathbf{r_{LG}^-} \uparrow$ & $\mathbf{r_{LG}^+} \uparrow$ & $\mathbf{r_{LG}^-} \uparrow$ & $\mathbf{r_{LG}^+} \uparrow$ & $\mathbf{r_{LG}^-} \uparrow$   \\
                SoftCAM & $ 0.32 \pm 0.2$ & $0.93 \pm 0.1$ & $0.75 \pm 0.11$ & $0.51 \pm 0.1$ & $0.75 \pm 0.1$ & $0.51 \pm 0.1$ \\ 
                sparse SoftCAM & $0.28 \pm 0.2$ & $0.94 \pm 0.1$ & $0.35 \pm 0.14$ & $0.95 \pm 0.1$ & $0.35 \pm 0.1$ & $0.95 \pm 0.1$ \\ 
                \hline
            \end{tabular}
             %\vspace{-0.4cm}
        \end{table}

        Overall, the effect of regularization on the explanations varies depending on the backbone architecture. Nevertheless, the activation consistency metric aligns well with the qualitative explanations, generally capturing the impact of regularization across the dataset for a given architecture.

    \subsection{Precision and sensitivity analysis}
        \label{suppl:precision-sensitivity-rsna}   
        We quantitatively evaluate the explanations generated by various methods using the ResNet and VGG backbones on the RSNA dataset. With the ResNet model, the unregularized SoftCAM achieves the highest localization precision, whereas the sparse SoftCAM yields the best results in terms of sensitivity. This discrepancy underscores the importance of developing evaluation metrics that balance human-aligned localization quality with model fidelity, capturing both interpretability and decision relevance.
        
        \begin{figure}[H]
            \centering
            \includegraphics[width=.82\textwidth]{figures/Suppl/fig3_rsna_quantitative_evaluation.pdf}
            \vspace{-0.2cm}
            \caption{\textbf{Precision vs. sensitivity analysis on the RSNA dataset}. Quantitative evaluation of explanations generated by different methods from the ResNet and VGG models on the RSNA dataset.}
            \label{fig2:app-vgg-qualitative-vis}
        \end{figure}

    \subsection{SoftCAM provides localized and faithful explanations}
        \label{suppl:prec-sens-eval}
        Alongside the visualization comparing SoftCAM variants with post-hoc methods, we provide the corresponding quantitative metrics, reporting both top-k precision and faithfulness.% (measured via the Area Under the Deleted Curve, AUDC). %for  $k=15$ and $k=30$.
        \vspace{-0.2cm}
        
        \begin{table}[H]
            \centering
            \caption{Top-15 localization precision and sensitivity. Sensitivity is quantified as the Area Under the Deleted Curve (AUDC), where lower values indicate greater faithfulness. For precision, higher values indicate better alignment between saliency maps and ground truth annotations. We refer to AUDC as \enquote{Del} and Top-K as \enquote{Top}.}
            \vspace{-0.15cm}
            \scriptsize
            \begin{tabular}{l|c|c|c|c|c|c|c|c|c|c|c|c}
                & \multicolumn{6}{c|}{ResNet (Top $\uparrow$, Del $\downarrow$)} & \multicolumn{6}{c}{VGG (Top $\uparrow$, Del $\downarrow$)}  \\
                \hline
                & \multicolumn{2}{c|}{Fundus} & \multicolumn{2}{c|}{OCT} & \multicolumn{2}{c|}{RSNA} & \multicolumn{2}{c|}{Fundus} & \multicolumn{2}{c|}{OCT} & \multicolumn{2}{c}{RSNA} \\
                %\hline
                & Top & Del & Top & Del & Top & Del & Top & Del & Top & Del & Top & Del  \\
                \hline
                ScoreCAM & $0.22$ & $0.77$ & $0.12$ & $0.84$ & $0.72$ & $0.98$ & $0.30$ & $0.78$ & $0.12$ & $0.76$ & $\bf{0.74}$ & $0.93$ \\
                LayerCAM & $0.25$ & $0.76$ & $0.11$ & $0.85$ & $0.75$ & $0.97$ & $0.30$ & $0.76$ & $0.11$ & $0.76$ & $0.72$ & $0.91$\\          
                GradCAM & $0.37$ & $0.75$ & $0.15$ & $0.84$ & $0.75$ & $0.97$ & $\bf{0.65}$ & $0.79$ & $0.15$ & $0.77$ & $0.72$ & $0.92$ \\ 
                Guided BP & $0.38$ & $\bf{0.69}$ & $0.36$ & $0.80$ & $0.60$ & $0.98$ & $0.48$ & $\bf{0.72}$ & $0.36$ & $\bf{0.65}$ & $0.66$ & $0.92$ \\ 
                Itgd Grad & $0.34$ & $0.73$ & $0.30$ & $0.82$ & $0.56$ & $0.98$ & $0.40$ & $0.75$ & $0.30$ & $0.77$ & $0.61$ & $0.93$ \\ 
                \hline
                SoftCAM & $0.40$ & $0.79$ & $0.46$ & $0.84$ & $0.74$ & $\bf{0.95}$ & $0.54$ & $0.73$ & $0.72$ & $0.68$ & $\bf{0.74}$ & $0.92$ \\ 
                sparse SoftCAM & $\bf{0.52}$ & $0.78$ & $\bf{0.86}$ & $\bf{0.57}$ & $\bf{0.78}$ & $\bf{0.95}$ & $\bf{0.65}$ & $0.78$ & $\bf{0.82}$ & $0.66$ & $0.71$ & $\bf{0.90}$ \\ 
                \hline
            \end{tabular}
        \end{table}
        
        \begin{table}[H]
            \centering
            \caption{Top-k localization precision and sensitivity, $k=30$. Sensitivity is quantified as the Area Under the Deleted Curve (AUDC), where lower values indicate greater faithfulness—that is, a larger drop in the model’s confidence when the most relevant patches are removed. For precision, higher values indicate better alignment between saliency maps and ground truth annotations.} 
            \vspace{0.3cm}
            \scriptsize
            \begin{tabular}{l|c|c|c|c|c|c|c|c|c|c|c|c}
                & \multicolumn{6}{c|}{ResNet (Top $\uparrow$, Del $\downarrow$)} & \multicolumn{6}{c}{VGG (Top $\uparrow$, Del $\downarrow$)}  \\
                \hline
                & \multicolumn{2}{c|}{Fundus} & \multicolumn{2}{c|}{OCT} & \multicolumn{2}{c|}{RSNA} & \multicolumn{2}{c|}{Fundus} & \multicolumn{2}{c|}{OCT} & \multicolumn{2}{c}{RSNA} \\
                %\hline
                & Top & Del & Top & Del & Top & Del & Top & Del & Top & Del & Top & Del  \\
                \hline
                ScoreCAM & $0.16$ & $0.67$ & $0.07$ & $0.73$ & $0.66$ & $0.97$ & $0.20$ & $0.67$ & $0.08$ & $0.55$ & $0.62$ & $0.88$ \\
                LayerCAM & $0.22$ & $0.65$ & $0.08$ & $0.74$ & $0.65$ & $0.97$ & $0.23$ & $0.64$ & $0.08$ & $0.56$ & $\bf{0.65}$ & $0.84$\\          
                GradCAM & $0.37$ & $0.64$ & $0.14$ & $0.73$ & $0.67$ & $0.95$ & $\bf{0.65}$ & $0.68$ & $0.14$ & $0.58$ & $0.61$ & $0.86$\\ 
                Guided BP & $0.30$ & $\bf{0.57}$ & $0.23$ & $0.68$ & $0.55$ & $0.97$ & $0.43$ & $0.57$ & $0.23$ & $\bf{0.40}$ & $0.58$ & $0.85$\\ 
                Itgd Grad & $0.28$ & $0.63$ & $0.20$ & $0.70$ & $0.52$ & $0.98$ & $0.33$ & $\bf{0.62}$ & $0.2$ & $0.51$ & $0.55$ & $0.88$\\ 
                \hline
                SoftCAM & $0.39$ & $0.69$ & $0.46$ & $0.61$ & $0.65$ & $\bf{0.92}$ & $0.54$ & $0.63$ & $0.72$ & $0.45$ & $0.64$ & $0.84$\\ 
                sparse SoftCAM & $\bf{0.52}$ & $0.68$ & $\bf{0.86}$ & $\bf{0.31}$ & $\bf{0.73}$ & $0.93$ & $\bf{0.65}$ & $0.674$ & $\bf{0.82}$ & $0.43$ & $0.63$ & $\bf{0.82}$\\ 
                \hline
            \end{tabular}
        \end{table}

\section{Activation precision and sensitivity on the RSNA dataset}

    \subsection{Lasso vs Ridge penalty}
        \label{suppl:ridge-penalty}
        \vspace{-0.4cm}
        \begin{figure}[H]
            \centering
            \includegraphics[width=.83\textwidth]{figures/Suppl/fig_lasso_ridge_rsna.pdf}
            \vspace{-0.3cm}
            \caption{\textbf{Model selection on validation sets under varying values.} The regularization strengths $\lambda_1$ and $\lambda_2$ influence model performance. The red markers indicate the selected regularization values, chosen to balance classification performance.}
            \label{fig:app-lasso-ridge-penalities}
            \vspace{-0.45cm}
        \end{figure}

    \subsection{Activation precision vs. activation sensitivity}
        \label{suppl:act-prec-sens-tab}
        \vspace{-0.6cm}
        \begin{table}[H]
            \centering
            \caption{ Activation Precision (AP) vs. Activation Sensitivity (AS) for different SoftCAM variants and baseline post-hoc methods. The base SoftCAM consistently lies between the lasso and ridge variants, highlighting the importance of balancing $\ell_1$ and $\ell_2$ values to achieve an optimal trade-off between precision and completeness.}
            \small %footnotesize scriptsize
            \begin{tabular}{l|c|c|c|c}
                & \multicolumn{2}{c|}{ResNet} & \multicolumn{2}{c}{VGG} \\
                %\hline
                & AP $\uparrow$ & AS $\uparrow$ & AP $\uparrow$ & AS $\uparrow$  \\
                \hline
                ScoreCAM & $0.470$ & $0.318$ & $0.403$ & $0.303$  \\
                LayerCAM & $0.456$ & $0.300$ & $0.401$ & $0.120$  \\
                GradCAM & $0.525$ & $0.252$ & $0.373$ & $0.260$  \\
                Guided BP & $0.381$ & $0.033$ & $0.364$ & $0.044$  \\
                Itgd Grad. & $0.286$ & $0.040$ & $0.322$ & $0.039$  \\
                \hline
                SoftCAM & $0.526$ & $0.251$ & $0.461$ & $0.355$  \\
                ridge SoftCAM & $0.440$ & $\bf{0.316}$ & $0.412$ & $\bf{0.396}$  \\
                sparse SoftCAM & $\bf{0.654}$ & $0.182$ & $\bf{0.519}$ & $0.320$  \\
                \hline
            \end{tabular}
        \end{table}

    \subsection{More examples: activation precision vs activation sensitivity}
        \label{suppl:act-prec-sens-visualization}
        \vspace{-0.2cm}
        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig4_rsna_saliency_v2.pdf}
            \vspace{-0.6cm}
            \caption{\textbf{Localization precision and sensitivity for pneumonia detection}. Each column shows explanations generated by different methods. Ground-truth bounding boxes are drawn on each map, with the top-right value indicating the activation precision, while the top-left value indicates the activation sensitivity. This shows the trade-off between ridge and lasso regularization. %The high precision from the lasso model and the more complete explanations from the ridge model emphasize the importance of balancing the two regularization strengths to achieve an optimal trade-off.
            }
            \vspace{-0.65cm}
        \end{figure}
    
 \section{Multi-class analysis}
    \label{multi-class-analysis}
     We extended our method to the multi-class setting for retinal disease diagnosis, using the same training setup as in the binary tasks. The only modification was adjusting the number of output classes in the evidence layer to 5 for DR grading (fundus dataset) and 4 for retinal disease classification (OCT dataset, and selecting appropriate lasso penalties (e.g. $\lambda_1=9.10^{-4}$ vs. $\lambda_1=3.10^{-6}$ for ResNet and VGG on the OCT dataset).
     
    \subsection{Regularization}
        \label{suppl:multi-class-reg}
        Given the small size of retinal lesions, we used Lasso regularization, selecting $\lambda_1$ values that balanced performance (Fig.\,\ref{fig:app-lasso-multi-penalities})

        \begin{figure}[H]
            \centering
            \includegraphics[width=0.85\textwidth]{figures/Suppl/fig_lasso_multi_reg_all.pdf}
            \vspace{-0.1cm}
            \caption{\textbf{Model selection on validation sets under varying sparsity values.} The regularization coefficients $\lambda_1$ influence model performance. The red markers indicate the selected regularization values to balance classification performance.}
            \label{fig:app-lasso-multi-penalities}
        \end{figure}

    \subsection{Area Under the Deleted Curve}
        \label{suppl:multi-AUDC}
        The relative area under the deletion curve (AUDC) was computed from the sensitivity analysis by occluding the top-30 patches ranked by importance in the explanation map (Fig.\,\ref{fig5:multi-class-sensitivity}).
        In both tasks, the dense and sparse SoftCAM achieved superior performance, with sparse SoftCAM yielding the lowest AUDC, indicating the highest faithfulness (Tab.\,\ref{tab:AUDC}).
        
        
        \begin{table}[H]
            \centering
            \caption{Area Under the Deleted Curve (AUDC $\downarrow$).}
            %\scriptsize % scriptsize footnotesize
            \vspace{4mm}
            \label{tab:AUDC}
            \begin{tabular}{l|c|c|c|c}
                & \multicolumn{2}{c|}{ResNet} & \multicolumn{2}{c}{VGG} \\
                %\hline
                & Fundus & OCT & Fundus & OCT  \\
                \hline
                ScoreCAM & $0.894$ & $0.819$ & $0.880$ & $0.852$  \\
                LayerCAM & $0.889$ & $0.817$ & $\bf{0.869}$ & $0.850$  \\
                GradCAM & $0.887$ & $0.815$ & $0.872$ & $0.847$  \\
                Guided BP & $0.899$ & $0.793$ & $0.905$ & $0.823$  \\
                Itgd Grad & $0.907$ & $0.821$ & $0.901$ & $0.833$  \\
                \hline
                dense SoftCAM & $0.870$ & $0.825$ & $0.905$ & $0.826$  \\
                sparse SoftCAM & $\bf{0.856}$ & $\bf{0.609}$ & $0.882$ & $\bf{0.806}$  \\
                \hline
            \end{tabular}
        \end{table}

    \subsection{Qualitative explanation on retinal fundus images}
        \label{suppl:multi-visualization}
        For the multi-class DR detection tasks on fundus images, SoftCAM variants produced more focused and class-consistent explanations. Alongside the sparse and unregularized evidence maps, we also include visualizations from post-hoc methods.
        
        \newpage

        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig5_multi_class_res_fundus_visualization.pdf}
            \vspace{-0.3cm}
            \caption{\textbf{Class-specific explanation with the ResNet backbone}. The application of our method to multi-class DR detection demonstrates the utility of class-specific explanations produced by the sparse SoftCAM, which more precisely highlight disease-relevant regions compared to the dense SoftCAM and the best-performing post-hoc method, GradCAM. In the example shown, the image is labeled as severe DR, and the highlighted regions correspond to suspicious areas, reflecting relevant DR lesions.}
            \label{suppl:fig5-multi-visua}
        \end{figure}

        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig5_multi_class_vgg_fundus_visualization.pdf}
            \vspace{-0.4cm}
            \caption{\textbf{Class-specific explanation with the VGG backbone}. }
            \label{suppl:fig5-multi-visua}
            %\vspace{-0.3cm}
        \end{figure}

    \subsection{Qualitative explanation on retinal OCT images}
        \label{suppl:multi-oct-visualization}
        For multi-class OCT-based retinal disease classification, SoftCAM variants yielded more focused and class-consistent explanations. We also provide visualizations from other methods. Note that GradCAM and Guided BP were the best-performing post-hoc methods for ResNet and VGG according to the Area Under the Deletion Curve (Tab.\,\ref{tab:AUDC}).
        \vspace{-0.5cm}
        % OCT ResNet
        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig5_multi_class_res_oct_visualization.pdf}
            \vspace{-0.65cm}
            \caption{\textbf{Class-specific explanation with the ResNet backbone}. Sparse SoftCAM provides more precise class-specific localization than unregularized SoftCAM and leading post-hoc methods, highlighting clinically relevant regions.}
            \label{suppl:fig5-multi-visua}
        \end{figure}

        % OCT VGG
        \begin{figure}[H]
            \centering
            \includegraphics[width=\textwidth]{figures/Suppl/fig5_multi_class_vgg_oct_visualization.pdf}
            \vspace{-4mm}
            \caption{\textbf{Class-specific explanation with the VGG backbone}.}
            \label{suppl:fig5-multi-visua}
        \end{figure}
        