\documentclass{midl} % Include author names

\usepackage{mwe} 
\usepackage{csquotes}
\usepackage{float}
\usepackage{soul}
\usepackage{bbm}

\usepackage{xcolor}
\newcommand{\rev}[1]{\textcolor{purple}{#1}}

%\jmlrvolume{-- Under Review}
\jmlrvolume{-- 39}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026 submission}
%\editors{Under Review for MIDL 2026}
\editors{Accepted for publication at MIDL 2026}

\title[SoftCAM: Making black box CNN models self-explainable for medical image analysis]{SoftCAM: Making black box models self-explainable for medical image analysis} %high-stakes decisions

%\footnotetext[1]{Contributed equally}

\midlauthor{\Name{Kerol Djoumessi\nametag{$^{1}$}} \orcid{0009-0004-1548-9758} \Email{kerol.djoumessi-donteu@uni-tuebingen.de}\\
\addr $^{1}$ Hertie Institute for AI in Brain Health, University of T\unexpanded{\"u}bingen, Germany \\
\AND
\Name{Philipp Berens\nametag{$^{1,2}$}} \orcid{0000-0002-0199-4727} \Email{philipp.berens@uni-tuebingen.de}\\
\addr $^{2}$ T\unexpanded{\"u}bingen AI Center, University of T\unexpanded{\"u}bingen, Germany
}

\begin{document}

\maketitle
\begin{abstract}
    Convolutional neural networks (CNNs) are widely used for high-stakes applications like medicine, often surpassing human performance. However, most explanation methods rely on post-hoc attribution, approximating the decision-making process of already trained black-box models. These methods are often sensitive, unreliable, and fail to reflect true model reasoning, limiting their trustworthiness in critical applications.
    In this work, we introduce SoftCAM, a straightforward yet effective approach that makes standard CNN architectures inherently interpretable. By removing the global average pooling layer and replacing the fully connected classification layer with a convolution-based class evidence layer, SoftCAM preserves spatial information and produces explicit class activation maps that form the basis of the model's predictions.
    Evaluated on three medical datasets spanning three imaging modalities, SoftCAM maintains classification performance while significantly improving both the qualitative and quantitative explanation compared to existing post-hoc methods. 
    The code is available at \url{https://github.com/kdjoumessi/SoftCAM}.
\end{abstract}

\begin{keywords}
    ElasticNet, Self-explainable models, Convolutional Neural Networks, Class Activation Maps, Attribution maps.
\end{keywords}

\section{Introduction}
    \label{sec:intro}
    Convolutional Neural Networks (CNNs) have revolutionized computer vision by effectively capturing local spatial patterns, reducing the number of parameters, and enabling faster convergence, leading to state-of-the-art performance in tasks such as image recognition and object detection \cite{oquab2015object, li2021survey}. However, their lack of interpretability limits adoption in high-stakes fields such as medical image analysis, where transparency and trust are essential. 
    To mitigate this issue, numerous post-hoc saliency- or attribution-based methods have been proposed to explain CNN predictions by highlighting \enquote{where} important features appear in the input through pixel-wise importance maps computed after training.     
    Prominent approaches include class activation maps (CAM) techniques \cite{zhou2016learning, selvaraju2017grad, he2022survey}, which combine convolutional features with class-specific weights; backpropagation-based methods \cite{springenberg2014striving, sundararajan2017axiomatic}, which propagate output gradients to the input to estimate pixel relevance; and perturbation-based methods \cite{wang2020score, ivanovs2021perturbation}, which systematically alter input features and measure the corresponding change in the model's output.
    
    While post-hoc attribution-based methods provide intuitive visualizations of discriminative regions, they frequently approximate rather than accurately reflect the model’s true reasoning \cite{adebayo2018sanity, saporta2022benchmarking} and are often valid only under strong regularity assumptions \cite{gunther2025informative}.   
    Their post-hoc nature further limits their effectiveness---particularly in clinical applications \cite{arun2021assessing}---due to low faithfulness, reliability, and consistency.   
    Consequently, their visual outputs may not accurately reflect the model's internal decision-making \cite{adebayo2018sanity, saporta2022benchmarking}.
    Moreover, post-hoc techniques often struggle to precisely localize disease-relevant regions in medical images \cite{arun2021assessing}, where the scarcity of ground-truth annotations further complicates validation. 
    To address these challenges, inherently or self-explainable models have been proposed \cite{rudin2019stop}, embedding interpretability directly into their architectures \cite{brendel2018bagnets, chen2019looks, koh2020concept, djoumessi2024actually}. By coupling prediction with explanation, self-explainable models provide interpretable and transparent insights, in line with human reasoning.
    However, their reliance on specialized architectures limits generalization to widely used CNN models.

    Motivated by these limitations, we introduce SoftCAM, a simple yet effective framework that makes CNNs inherently interpretable without relying on post-hoc explanation methods. SoftCAM generalizes the concept of class activation maps to transform conventional CNNs into self-explainable models. 
    By removing the final global average pooling layer and replacing the fully connected classifier with a convolution-based class-evidence layer, SoftCAM converts standard CNNs into fully convolutional architectures that generate explicit, class-specific evidence maps used both for prediction and visual explanation. 
    Evaluated on two widely used CNN architectures, we showed that the resulting SoftCAM-based models maintain competitive accuracy relative to their black-box baselines while achieving superior interpretability across three clinically relevant medical imaging datasets spanning three modalities.
    Furthermore, applying ElasticNet regularization, which combines ridge and lasso penalties, to the evidence maps improves explanations both qualitatively and quantitatively, revealing a task-dependent trade-off between sparsity and density. 
    In addition, we introduce a novel explainability metric, \emph{activation sensitivity}, which penalizes false negatives and weak activations within expert-annotated regions.
    Finally, a comprehensive evaluation against six widely used post-hoc attribution methods demonstrated that SoftCAM consistently outperforms these approaches on various explainability metrics.

\section{Method}
    \label{method}
    \paragraph{Preliminaries} Given an input image $\mathbf{X} \in \mathbb{R}^{H_X \times W_X \times C_X}$ with height $H_X$, width $W_X$, and the number of channels $C_X$, consider a CNN network $f_\theta$ that maps $\mathbf{X}$ to a probability distribution over $C$ classes, $\mathbf{\hat{y}} = f_\theta(\mathbf{X}) \in \mathbb{R}^C$, where $y^c \in \mathbf{y}$ denotes the predicted probability for class $c$. 
    The network consists of a feature extractor $g_\phi$, and a classifier layer $h_\psi$, with learnable parameters $\phi$ and $\psi$, respectively. 
    The feature extractor produces a feature map $\mathbf{Z} = g_\phi(\mathbf{X}) \in \mathbb{R}^{N \times M \times D}$, where $N \times M$ is the spatial resolution and $D$ is the feature dimension (e.g., $D=2048$ for standard ResNet variants). The classifier then generates the final prediction based on $\mathbf{Z}$. 
    Let $\mathcal{A} = \{\mathbf{A}_k\}_{k=1}^D$ denote the set of activation maps from the feature extractor, with $A_k \in \mathbb{R}^{N \times M}$ representing the $k$-th channel. The 2D low-resolution saliency map $S^{c}_{\text{Map}} \in \mathbb{R}^{N \times M}$ provides a visual explanation of the model's prediction for class $c$. 
    This work focuses on training self-explainable CNN classifiers that simultaneously produce both the class prediction $y^c$ and its corresponding explanation $S^{c}_{\text{Map}}$.
    
    In contrast, traditional CNNs are black-boxes: a global average pooling layer reduces $\mathbf{Z}$ to a vector of size $1 \times D$, which is then fed into one or more linear fully connected layers (FCL) to generate the final prediction. Post-hoc methods are required to explain decisions.

    \subsection{CAM-based methods}
        Class Activation Maps (CAM) \cite{zhou2016learning} are closely related to our approach, providing visual explanations of CNN predictions through class-specific saliency maps. CAM operates by linearly combining the final convolutional feature maps with their corresponding importance weights from the FCL classifier to produce class-wise attribution maps:        
        \begin{equation}
            \label{cam}
            S_{\text{CAM}}^{c}(x_1, x_2) = \sum_{k=1}^D w_k^c A_k (x_1, x_2),
        \end{equation}         
        where $A_k (x_1, x_2)$ denotes the activation of the $k$ feature map at location $(x_1, x_2)$, and $w_k^c$ is the class-specific importance weight for the $k$ feature map from the fully connected layer.
    
        Originally, CAM was designed for CNNs with a global average pooling (GAP) layer followed by a single FCL. Subsequent extensions, such as GradCAM \cite{selvaraju2017grad} and LayerCAM \cite{jiang2021layercam}, introduced gradient-based approaches that use gradient to compute importance weights, enabling class-specific explanations for various architectures---particularly those with multiple FCLs after the GAP layer, such as VGG \cite{simonyan2014very}. In GradCAM, the importance weights are computed by globally averaging the gradients of the target class score
        as $w_k^c = \frac{1}{N \times M} \sum_i^N \sum_j^N \frac{\partial y^c}{\partial A_k (i,j)}$, where $A_k(i,j)$ represents the activation at spatial location $(i,j)$ in the $k$-th feature map. Following GradCAM, several gradient-based and gradient-free extensions have emerged \cite{he2022survey}. 
        While gradient-based methods differ primarily in how they aggregate gradients to compute importance weights, gradient-free approaches compute weights without backpropagation, often relying on perturbation-based methods like ScoreCAM \cite{wang2020score}.
    
        Despite their widespread clinical use \cite{ayhan2022clinical}, attribution-based methods have notable limitations \cite{gunther2025informative}: they provide post-hoc explanations that may not reflect the model's true reasoning (Appendix \ref{suppl:related-work}). 
        Gradient-based variants can suffer from gradient saturation and false confidence \cite{wang2020score}, while gradient-free approaches are computationally costly, requiring many forward passes on perturbed inputs.

    \subsection{Generalizing CAMs for self-explanability} 
        \label{extendingCAM}
        Motivated by the limitations of post-hoc class activation map-based methods in interpreting CNN, we introduce SoftCAM (Fig.\,\ref{fig:architecture}), a straightforward modification of black-box CNN classifiers that makes them inherently interpretable. 
        SoftCAM achieves this by replacing the fully connected classification layer in classical CNNs with an explicit class-evidence convolutional layer, preserving spatial information and providing explanations in a single forward pass, eliminating the need and computational overhead for post-hoc techniques.
        
        \begin{figure}[t]
            \centering
            \includegraphics[width=0.8\textwidth]{figures/fig1_architecture.pdf}
            \caption{\textbf{Overview of SoftCAM architecture.} (\textbf{a}) Input image. (\textbf{b}) The CNN backbone consists of all layers before the global average pooling layer. (\textbf{c}) Feature map generated by the backbone. (\textbf{d}) Classifier module with $C$ convolutional kernels of size $1 \times 1$. (\textbf{e}) Self-explainable class activation maps $\mathbf{A}$, obtained from the classifier with ElasticNet penalty applied to it to enhance interpretability. (\textbf{f}) Final predictions are derived from the evidence maps via spatial average pooling followed by the softmax function. Class-specific evidence maps (\textbf{g}) are upsampled and overlaid on the input to visualize the model's decision-making process.}
            \label{fig:architecture}
            %\vspace{-0.15cm}
        \end{figure}

        We make black-box CNN self-explainable by modifying how predictions are obtained. Any FCL of size $b_1 \times b_2$ can be equivalently expressed as a $1 \times 1$ convolutional layer with $b_1$ input channels and $b_2$ output channels \cite{donteu2023sparse}.
        This allows systematic replacement of FCL classifier heads in CNN classifiers with convolutional layers, eliminating the GAP layer before classification while preserving model complexity and spatial localization. 
        The new classifier module $h$ consists of convolutional layers (Fig.\,\ref{fig:architecture}d) with $C$ convolution kernels of size $1 \times 1$ and unit stride, producing class-specific evidence maps (Fig.\,\ref{fig:architecture}e):
        \begin{equation}
            \mathbf{A} = h_{\psi}(\mathbf{Z}) \in \mathbb{R}^{M \times N \times C},
        \end{equation}
        where $\psi$ is a learnable parameter. These maps can be upsampled to the input resolution and overlaid on the image (Fig.\,\ref{fig:architecture}g) for visualization and clinical interpretation.
        The module $h_\psi$ can be viewed as a \emph{soft}, explainable generalization of classical post-hoc attribution methods (Eq.\,\ref{cam}), mapping the high-dimensional feature map $\mathbf{Z}$ with $D$ channels into low-dimentional, class-wise activation maps $\mathbf{A}$ with $C$ channels corresponding to the target classes. 
        Unlike CAM (Eq.\,\ref{cam}) and related attribution methods \cite{he2022survey}, our approach leverages the final feature map of the backbone and applies a parameterized function $h_\psi$ to produce class activation maps that directly support prediction.
        Instead of explicitly defining importance weights, these are implicitly learned and encoded within the classifier’s parameters, enabling \emph{soft} generation of class activation maps jointly with prediction during training.

        The resulting architecture is a fully convolutional, self-explainable model, where the predicted probabilities are derived directly from the evidence maps (Fig.\ref{fig:architecture}e), maintaining the model complexity without adding extra learnable parameters:
        \begin{equation}
            \mathbf{\hat{y}} = \text{Softmax} \bigg( \mbox{AvgPool} \Big( h_\psi \big( g_\Phi (\mathbf{X}) \big) \Big) \bigg) \in \mathbb{R}^{1 \times C}.
        \end{equation}
        Furthermore, the class evidence maps $\mathbf{A}$ serve as built-in explanations, directly representing the contribution of individual input regions to the final prediction (Fig.\,\ref{fig:architecture}g). 
        Using class-evidence maps for classification offers several advantages: all importance scores are weighted equally when computing class probabilities (Fig.\,\ref{fig:architecture}f). 
        Consequently, input feature patches with high activations in the evidence maps contribute more strongly to the prediction, analogous to linear models, where each input feature contributes linearly to the output. 

    \subsection{Regularizing SoftCAM for interpretability}
         By using explicit class-evidence maps, SoftCAM-based models can be trained with regularization constraints directly applied to the explanation maps, enhancing interpretability. In practice, we apply an ElasticNet penalty \cite{zou2005regularization}, which linearly combines the $\ell_1$ (lasso) and $\ell_2$ (ridge) regularization, leading to the following loss function: 
        \begin{equation}
            \mathcal{L}(\mathbf{y},\mathbf{\hat{y}}) = \mbox{CE}(\mathbf{y}, \mathbf{\hat{y}}) + \lambda_1 \sum_{i,j,c} |\mathbf{A}_c^{ij}| + \lambda_2 \sum_{i,j,c} ||\mathbf{A}_c^{ij}||_2.
            \label{eq:loss}
        \end{equation}
        Here, CE denotes the cross-entropy loss, and $\mathbf{y}$ represents the reference labels. Setting $\lambda_2=0$, results in the lasso penalty promoting sparsity in evidence maps \cite{donteu2023sparse} by suppressing less informative activations (mainly false positives), making it particularly useful for tasks where precision in explanations is crucial. In contrast, $\lambda_1=0$, gives the ridge penalty, which smooths activations without forcing them to zero, useful for assessing localization sensitivity over large regions, as it penalizes false negatives (Appendix \ref{suppl:xai-metrics}).
        ElasticNet thus offers a flexible balance between lasso ($\lambda_2 = 0$) and ridge ($\lambda_1 = 0$) penalties, which can be chosen according to the task and the explainability metric being optimized.

\section{Experimental setup}
    \label{experimental_setup}
    \paragraph{Datasets.} We evaluated our approach on three publicly available medical datasets spanning three imaging modalities: the Kaggle Diabetic Retinopathy (DR) \cite{kaggle_dr_detection}, Retinal OCT \cite{kermany2018identifying}, and the RSNA Chest X-Ray (CXR) \cite{rsna_dataset}. 
    The first dataset comprised high-resolution retinal color fundus images labeled with DR severity score ranging from $0$ (No DR) to $4$ (Proliferative DR). 
    The second dataset included retinal OCT B-scans images labeled for three retinal disease conditions.
    The final dataset consisted of high-resolution frontal-view chest radiographs labeled for pneumonia detection, with bounding boxes for pneumonia cases. 
    Additionally, clinicians annotated lesions in $65$ DR images from the Kaggle dataset \cite{djoumessi2024inherently} and $40$ drusen lesions from the retinal OCT dataset \cite{djoumessi2024actually} to allow clinical evaluation.    
    Each dataset was split into training, validation, and test sets, ensuring that all samples from a given patient remained in the same split. For full details, see Appendix \ref{suppl:dataset}.
    
    \paragraph{Baseline models.} Our method was evaluated on two widely used black-box CNN architectures: ResNet-50 \cite{he2016deep} and VGG-16 \cite{simonyan2014very}. These models primarily differ in the design of their classification heads: ResNet uses a GAP layer followed by a single linear classifier, whereas VGG flattens the convolutional feature maps and employs multiple fully connected layers for classification.   
    More details on the training setup\footref{github}, including data preprocessing and augmentation, are provided in the Appendix \ref{suppl:training_setup}.

    \paragraph{Post-hoc baseline.} SoftCAM was compared against six widely used post-hoc explanation methods (Appendix \ref{suppl:cam-based-methods}), including CAM-based gradient approaches such as GradCAM \cite{selvaraju2017grad} and LayerCAM \cite{jiang2021layercam}; CAM-based gradient-free methods such as the original CAM \cite{zhou2016learning} and  ScoreCAM \cite{wang2020score}; and backpropagation-based techniques such as Integrated Gradients (Itgd Grad) \cite{sundararajan2017axiomatic} and Guided Backpropagation (Guided BP) \cite{springenberg2014striving}. Each post-hoc method was evaluated on its respective black-box CNN models.

    \paragraph{Evaluation metrics.} The models were evaluated for predictive performance and explainability. For explainability, we used several quantitative metrics, including \emph{top-k localization precision} \cite{donteu2023sparse}; \emph{activation precision} \cite{barnett2021case}, \emph{faithfulness} \cite{yeh2019fidelity}, and \emph{activation consistency} \cite{donteu2023sparse}, and we further extended activation precision to define \emph{activation sensitivity}. 
    Together, these metrics quantify both alignment with expert clinician annotations and alignment with the model’s actual decision-making process. Full metric descriptions are provided in Appendix \ref{suppl:xai-metrics}. 

\section{Results}
    \label{results}
    \subsection{Making black box CNNs explainable maintains classification performance}
        \label{sec:bin_classification}
        % Kaggle (bin, multi): ResNet (5e5 vs 2e4), VGG (1e6, 7e6) 
          % Kaggle l2 (bin, multi l2): ResNet (1e5 vs -), VGG (1e5, -) 
       % OCT (bin, multi): ResNet (15e3 vs 9e4), VGG (14e3, 3e6)        
         %OCT (bin l2, multi l2): ResNet (1e5 vs -), VGG (1e5, -)
       % RSNA (l1, l2): ResNet (1e3 vs 7e5) VGG (8e5, 2e4)
       % mulit kaggle, OCT: 1e5
        \begin{table}[t]
            \centering
            \caption{Classification performance for disease detection on the test sets. SoftCAM variants of both CNNs are denoted by $^{SC}$, with $\ell_\lambda$ indicating the applied penalty.}
            %\vspace{-0.06cm}
            %\scriptsize % scriptsize footnotesize
            \small
            \begin{tabular}{l|c|c|c|c|c|c|c|c|c|c}
                   &  \multicolumn{4}{c|}{Kaggle Fundus}  & \multicolumn{4}{c|}{OCT retinal} & \multicolumn{2}{c}{RSNA CXR} \\
                   \hline
                 & \multicolumn{2}{c|}{Binary} & \multicolumn{2}{c|}{Multi-class} & \multicolumn{2}{c|}{Binary} & \multicolumn{2}{c|}{Multi-class} & \multicolumn{2}{c}{Binary} \\
                 &  Acc. & AUC & Acc. & $\kappa$ & Acc. & AUC & Acc. & $\kappa$ & Acc. & AUC \\
                 \hline
                 VGG-16 & $.907$ & $.938$ & $\textbf{.863}$ & $\textbf{.835}$ & $\textbf{.994}$ & $\textbf{1.0}$ & $\textbf{.967}$ & $\textbf{.955}$ & $.952$ & $.989$ \\
                 VGG-16$^{SC}$ & $\mathbf{.915}$ & $\textbf{.942}$ & $.861$ & $.834$ & $\textbf{.994}$ & $\textbf{1.0}$ & $.963$ & $.947$ & $\textbf{.957}$ & $\textbf{.999}$ \\
                 $\ell_1$-VGG-16$^{SC}$ & $.911$ & $.938$ & $.859$ & $.827$ & $.988$ & $.999$ &  $.947$ & $.929$ & $.953$ & $.990$ \\
                 $\ell_2$-VGG-16$^{SC}$ & $.910$ &  $.937$ & $.858$ & $.820$ & $.988$ & $.999$ & $.961$ & $.946$ & $.951$ & $.988$ \\
                 \hline
                 ResNet-50 & $.899$ & $.923$ & $.850$ & $.800$ & $.994$ & $.999$ &  $.970$ & $\textbf{.963}$ & $\textbf{.953}$ & $\textbf{.988}$ \\
                 ResNet-50$^{SC}$ & $.899$ & $\textbf{.926}$ & $.851$ & $\textbf{.811}$ & $.994$ & $\textbf{1.0}$ & $\textbf{.974}$ & $.960$ & $.942$ & $.983$ \\
                 $\ell_1$-ResNet-50$^{SC}$ & $.895$ & $.923$ &  $.851$ & $.801$ & $\textbf{.996}$ & $\textbf{1.0}$ & $.963$ & $.955$ & $.941$ & $.979$ \\
                 $\ell_2$-ResNet-50$^{SC}$ & $\textbf{.901}$ & $.924$ & $\textbf{.854}$ &  $.810$ & $.994$ & $.999$ & $.965$ & $.952$ & $.958$ & $.986$ \\
                 \hline
            \end{tabular}
            \label{tab1:classification}
            %\vspace{-0.35cm}
        \end{table}

        We first evaluated our method on clinically relevant binary classification tasks, including retinal disease classification from color fundus 
        (\{0\} vs. \{1-4\}) and OCT retinal images (Normal vs. Drusen), as well as pneumonia detection from chest X-rays, using accuracy and AUC as primary metrics.        
        For each CNN architecture, the $^\text{\enquote{SC}}$ model denotes our method without regularization ($\lambda_1 = \lambda_2 = 0$). 
        The \enquote{$\ell_1$} and \enquote{$\ell_2$} variants are obtained by applying either a lasso ($\lambda_2 = 0$) or a ridge penalty ($\lambda_1 = 0$), respectively, with task- and architecture-specific regularization strenghts (e.g. $\lambda_1=1.10^{-6}$ for VGG and $\lambda_1=5.10^{-5}$ for ResNet on the fundus dataset).    
        The sparsity hyperparameters were selected to maintain strong classification performance (Appendix \ref{suppl:binary-lasso-reg}) while yielding qualitatively meaningful visual explanations on a subset of annotated images. 

       Our results show that SoftCAM-based models (Tab.\,\ref{tab1:classification}), with explicit class evidence maps, preserve classification performance comparable to their corresponding black-box baselines. Moreover, introducing the lasso and ridge regularizations on the class evidence map did not significantly degrade performance, as is sometimes observed when enforcing interpretability \cite{rudin2019stop}; in some cases, it even led to slight improvements, particularly in retinal disease classification. 
       These findings suggest that using convolutional layers in the classification head is an effective and promising approach for developing high-performing, self-explainable CNN models.

    \subsection{SoftCAM provides inherently interpretable visual explanations}   
        We qualitatively compared the evidence maps of the SoftCAM variants with the saliency maps generated by the six state-of-the-art CAM and back-propagation-based methods. Overall, our method produced more visually interpretable maps with high evidence regions centered on annotated lesions (Fig.\,\ref{fig2:qualitative-vis}). 
        We observed that the regions highlighted by the sparse SoftCAM models are typically a subset of those identified by both the unregularized and ridge SoftCAM variants, reflecting the sparsity constraint's effect in reducing irrelevant activations, while ridge regularization promotes denser activations. Additional results, including those for VGG-16 and other examples, are provided in Appendix \ref{suppl:fig2-vgg-qualitative-binary}. 

        On healthy images, the sparse SoftCAM evidence maps exhibited overall more negative activations, in contrast to the positive activations observed on disease images. To assess this quantitatively, we computed the activation consistency score \cite{donteu2023sparse}, calculating the proportion of positive and negative activations in disease and healthy samples from the test set. These findings were consistent with qualitative visualization (e.g. sparse SoftCAM vs. SoftCAM on the fundus dataset using ResNet: $0.55$ vs. $0.27$ for the average proportion of positive activations on disease images). For a full analysis, see Appendix \ref{suppl:activation-consistency-results}.

        \label{sec:localization-faithfulness} 
        \begin{figure}[t]
            \centering
            \includegraphics[width=\textwidth]{figures/fig2_qualitative_visualization.pdf}
            %\vspace{-0.5cm}
            \caption{\textbf{Example explanations generated by different methods from ResNet-50}. The first column shows disease images with reference annotations (green markers or bounding boxes). The rows from top to bottom correspond to fundus, OCT, and Chest X-ray images, respectively. The next five columns present saliency maps from different post-hoc explanation methods. The last three columns show our proposed inherently interpretable SoftCAM--based explanations.}
            \label{fig2:qualitative-vis}
            %\vspace{-0.7cm}
        \end{figure}

    \subsection{SoftCAM provides localized and faithful explanations}   
        \label{sec:faithfulness}
        \begin{figure}[t]
            \centering
            \includegraphics[width=\textwidth]{figures/fig3_quantitative_evaluation.pdf}
            %\vspace{-0.5cm}
            \caption{\textbf{Quantitative evaluation of explanations methods}. 
            The first row shows the localization precision of saliency maps on fundus and OCT datasets.  
            The second row presents the sensitivity analysis assessing faithfulness.  
            Columns \textbf{a,b} show ResNet results, and \textbf{c,d} correspond to VGG. Higher precision indicates better localization; lower sensitivity reflects more reliable explanations.}
            \label{fig3:quantitative-analysis}
            %\vspace{-0.6cm}
        \end{figure}

        To quantitatively assess the explanations provided by our SoftCAM-based evidence maps compared to post-hoc saliency methods, we first evaluated their localization precision, which measures how consistently the highlighted regions in the explanation maps align with clinician-annotated disease findings. Following \cite{donteu2023sparse}, we computed the Top-k (k=15) localization precision by upsampling each explanation map to the input resolution, splitting it into non-overlapping $33 \times 33$ patches, and calculating the proportion of positively activated patches that overlap with ground truth annotations. 
        Although inherently interpretable, SoftCAM-based explanations performed competitively overall in terms of localization precision (Fig.\,\ref{fig3:quantitative-analysis}). 
        Notably, the sparse SoftCAM with the ResNet backbone outperformed all other methods with the highest top-k precision (Appendix \ref{suppl:precision-sensitivity-rsna} and \ref{suppl:prec-sens-eval}), and ranked second only in top-3 precision on the fundus dataset (Fig.\,\ref{fig3:quantitative-analysis}a), behind Guided BP, which benefits from high-resolution saliency maps.    
        Furthermore, the base and sparse SoftCAM typically achieved higher precision with fewer top-k regions, leading to fast convergence, particularly on the Fundus and OCT datasets. This suggests that sparse SoftCAM more consistently highlights fewer, yet truly relevant regions, whereas post-hoc methods produce broader and less specific activations, resulting in higher false-positive rates.

        Subsequently, we evaluated the faithfulness (also referred to as sensitivity) of the evidence maps generated by our SoftCAM-based explanations in comparison to post-hoc saliency maps. 
        Sensitivity analysis evaluates how much the highly activated regions in an explanation map contribute to the model's prediction \cite{yeh2019fidelity}, thereby assessing whether the highlighted areas actually influence the model’s decision-making process. 
        To do this, we split the input images into non-overlapping $33 \times 33$ patches, then progressively removed the top-ranked patches (based on attribution scores) and measured the relative change in model confidence. 
        We conducted this evaluation on samples that were correctly predicted by both the black-box CNNs and their corresponding SoftCAM variants from the test sets. We found that the sparse SoftCAM generally outperformed other methods, notably on the OCT and RSNA datasets (Fig.\,\ref{fig3:quantitative-analysis}; Appendix \ref{suppl:precision-sensitivity-rsna} and \ref{suppl:prec-sens-eval}). 
        On the fundus dataset, both the base and sparse SoftCAM models performed slightly below the best-performing post-hoc methods, with Guided BP yielding the highest sensitivity scores, followed by Integrated Gradients (Fig.\,\ref{fig3:quantitative-analysis}). 
        On the OCT dataset, the SoftCAM variants outperformed all post-hoc methods with the ResNet model, while with the VGG the sparse and base versions ranked second and third, respectively. 
        On the RSNA dataset, the sparse SoftCAM achieved the highest sensitivity, surpassing all other methods, with the base SoftCAM ranking second with ResNet and third with VGG (Appendix \ref{suppl:prec-sens-eval}).

    \subsection{Ridge regularization improves explanation for large disease regions}
        % ResNet (l1, l2): (1e3, 7e5)
        % VGG (l1, l2): (8e5, 2e4)  
        \begin{figure}[t]
            \centering
            \includegraphics[width=\textwidth]{figures/fig4_rsna_saliency.pdf}
            %\vspace{-0.5cm}
            \caption{\textbf{Example of localization evaluation on the CXR dataset for pneumonia detection}. The first row shows saliency maps generated by different methods from the ResNet model, and the second row from the VGG model. Ground-truth bounding boxes are overlaid on each map, with the top-right value indicating the activation precision, while the top-left value indicates the activation sensitivity.}
            \label{fig4:activation-prec}
            %\vspace{-0.6cm}
        \end{figure}
    
        Since the CXR dataset provided larger bounding boxes localizing disease regions, unlike the sparse point-wise lesion annotations available in the fundus and OCT datasets, we computed activation precision \cite{barnett2021case}, which measures the proportion of the class-guided explanation that falls within the ground-truth bounding boxes, emphasizing precision by penalizing only false positives. However, it does not account for sensitivity by not penalizing false negatives.
        To address this, we extended this metric by introducing activation sensitivity (Appendix \ref{suppl:activation-prec-sens}), which penalizes false negatives to better assess the explanation completeness, especially important in clinical imaging tasks where missing relevant regions can be critical \cite{rsna_dataset}.  
        We further investigated how different regularization methods qualitatively and quantitatively affect explanations. While lasso regularization promotes sparsity by shrinking most false positive activations to zeros, it can lead to suboptimal interpretability on tasks involving large lesion areas. In contrast, ridge regularization encouraged small but nonzero activations, resulting in denser and more informative evidence maps. 
        To evaluate this, we trained a ridge SoftCAM model ($\lambda_1 = 0$) and compared its performance to the sparse SoftCAM, as well as to the post-hoc explanation methods. 
        The ridge regularization strength was selected to balance predictive performance ($\lambda_2=7.10^{-5}$ vs. $\lambda_2=2.10^{-4}$ for ResNet and VGG; Appendix \ref{suppl:ridge-penalty}) while maintaining qualitatively meaningful visual explanations.

        Under comparable accuracy (Acc.$\, \approx 0.95$ for ridge ResNet$^{SC}$, and  VGG$^{SC}$), we found that all SoftCAM variants---unregularized, sparse, and ridge---generally outperformed the evaluated post-hoc explanations in both activation precision and activation sensitivity (Fig.\,\ref{fig4:activation-prec}; Appendix \ref{suppl:act-prec-sens-tab} and \ref{suppl:act-prec-sens-visualization}). 
        Specifically, sparse SoftCAM achieved the highest activation precision, while ridge SoftCAM excelled in activation sensitivity. The unregularized SoftCAM consistently performed in between, underscoring the importance of balancing lasso and ridge penalties via ElasticNet to suit diverse datasets, tasks, and interpretability requirements; this balance can be selected empirically. 

    \subsection{SoftCAM provides faithful explanations for multi-class tasks}
        \begin{figure}[t]  
            \centering  
            \includegraphics[width=\textwidth]{figures/fig5_multiclass_sensitivity_analysis.pdf}
            %\includegraphics[width=0.7\textwidth]{figures/Suppl/suppl_muliclass_sensitivity.pdf}
            %\vspace{-0.5cm}
        \caption{\textbf{Sensitivity analysis for the multiclass task.} Evaluation on retinal fundus and OCT datasets assessing how faithfully each explanation method captures the model’s internal decision-making process. Lower relative sensitivity indicates more reliable explanations, as they reflect greater changes in output when important features are removed.}
            \label{fig5:multi-class-sensitivity}
            %\vspace{-0.6cm}
        \end{figure}
        
        Finally, we extended our method to the multi-class setting for retinal disease diagnosis, training the SoftCAM models from ResNet and VGG for DR detection (5 classes, Kaggle dataset) and retinal disease classification (4 classes, OCT dataset). 
        The training setup remained consistent with the binary task, adjusting only the number of classes in the evidence layer and selecting appropriate penalties.
        Given the small size of retinal lesions, we used lasso regularization, selecting $\lambda_1$ values that preserve predictive performance while providing qualitatively good visualizations on a small set of samples (e.g. $\lambda_1=9.10^{-4}$ vs. $\lambda_1=3.10^{-6}$ for ResNet and VGG on the OCT dataset; Appendix \ref{suppl:multi-class-reg}).
        Both unregularized and sparse models achieved performance comparable to their respective black-box baselines (Tab.\,\ref{tab1:classification}), with a slight improvement in Cohen’s kappa ($\kappa$) on the fundus dataset when using the ResNet backbone. The quadratic kappa accounts for agreement beyond chance.

        As no ground-truth lesion annotations were available for the multi-class tasks, we evaluated the faithfulness of the explanations by measuring their contribution to model predictions. For correctly classified test samples from all models, we progressively removed top-k ($k=30$) ranked patches (based on the explanation maps; see Sec.\ref{sec:faithfulness}) and tracked the relative average drop in class confidence. In both tasks, sparse SoftCAM achieved the best performance (Fig.\,\ref{fig5:multi-class-sensitivity}), yielding the lowest area under the deletion curve and thus the highest faithfulness, except for the VGG backbone on fundus images. Full quantitative results are reported in Appendix \ref{suppl:multi-AUDC}.        

        Notably, the sparse SoftCAM produced class-wise explanations that aligned well with class model confidence, showing minimal evidence in healthy classes (Fig.\ref{fig6:multi-class-visualization}; Appendix \ref{suppl:multi-visualization}, for VGG). In the case of DR detection, a progressive disease, it is expected that images labeled with grade $x$, where $1<x<5$, may still exhibit features from earlier stages, consistent with explanations. 
        Unlike post-hoc CAM-based methods, which require backpropagation or perturbation for each class, SoftCAM generates class-specific explanations along with prediction in a single forward pass, making it more resource-efficient.
        
        Briefly, both unregularized and sparse models achieved test accuracies comparable to their baseline (Acc $\approx 0.85$ vs. $0.97$ on Fundus and OCT), with SoftCAM, particularly the sparse variant, providing the most interpretable class-wise explanations with the best sensitivity.
        Qualitative explanations for the OCT results are provided in Appendix \ref{suppl:multi-oct-visualization}.
        \begin{figure}[t]
            \centering
            \includegraphics[width=\textwidth]{figures/fig6_multi_class_res_fundus_visualization.pdf}
            %\vspace{-0.5cm}
            \caption{\textbf{Examples of multi-class explanations with ResNet.} For a severe diabetic retinopathy case from the Kaggle dataset, the rows show class-specific explanations generated by SoftCAM variants. Integrated Gradients is the best-performing post-hoc method in terms of sensitivity, following the unregularized and sparse SoftCAM variants. The color scale of the explanation maps ranges from green (healthy evidence) to red (disease evidence).}
            \label{fig6:multi-class-visualization}
            %\vspace{-0.6cm}
        \end{figure}
        
\section{Discussion}
    \label{discussion}    
    Here, we introduced SoftCAM, a lightweight architectural modification that makes standard black-box CNNs inherently interpretable without relying on post-hoc explanations. 
    SoftCAM replaces the final pooling and fully connected layer with 1×1 convolutions, producing class-specific evidence maps that directly inform predictions. This design also supports ElastiNet regularization on the evidence maps to improve explanations.
    As a result, SoftCAM produces aligned predictions and explanations in a single forward-pass, yielding resource-efficient, self-explainable CNNs.     
    We validated the method on two widely used CNN backbones, ResNet-50 and VGG-16, across three medical imaging modalities---fundus photographs, retinal OCT scans, and Chest X-rays---assessing both classification performance and explainability, and found that the resulting explainable model achieve classification performance comparable to their black-box baselines.       
    For explainability, SoftCAM was benchmarked against six state-of-the-art post-hoc saliency methods---CAM, ScoreCAM, LayerCAM, GradCAM, Guided BP, and Itgd Grad---using qualitative and quantitative metrics that capture both alignment with human knowledge and alignment with the model’s true decision-making, demonstrating superior interpretability without compromising predictive performance. 
    In particular, localization precision and the introduced activation sensitivity assess how well explanations match clinically relevant biomarkers using expert annotations, while faithfulness measures how highlighted regions truly influence the model's predictions. 
    Interestingly, our results revealed a discrepancy between human-aligned and model-aligned metrics (Fig.\,\ref{fig3:quantitative-analysis}): explanations with the highest faithfulness did not always correspond to the strongest alignment with expert annotations. This might suggests two possible explanations: first, the models may rely on clinically relevant signals that are hidden to humans, highlighting the need for evaluation metrics that jointly account for human domain knowledge and the models’ internal reasoning; alternatively, occlusion may introduce out-of-distribution inputs that potentially affect sensitivity analysis results, thereby undermining the reliability of the explanations \cite{hase2021out}.

    Despite its promising results, SoftCAM has some limitations that warrant further investigation. 
    First, SoftCAM relies on the final low-resolution feature maps (e.g., 16×16 for standard CNNs), limiting the spatial granularity needed for fine-grained medical tasks that require lesion- or pixel-level precision. In our case, both ResNet and VGG models use large receptive fields, producing low-resolution feature maps. Consequently, the class-evidence layer generates coarse-grained explanations, though ElasticNet constraints help improve localization. 
    Future work could focus on generating higher-resolution explanations with softCAM to overcome coarseness from downsampling in deep CNNs; integrating SoftCAM with other architectures like ViTs \cite{dosovitskiy2020image}; extending it to tasks such as weakly supervised segmentations or object detection; and evaluating its utility in other high-stake domains, including agriculture or medical modalities such as dermoscopy for skin cancer detection or MRI for brain tumor detection. 
    Finally, SoftCAM was compared only against a subset of relevant post-hoc methods, consistent with our primary goal of avoiding post-hoc explanations for CNNs classifiers. 
    Future work could extend this evaluation to include self-explainable models, such as part-prototype networks \cite{chen2019looks}, concept-based models \cite{koh2020concept}, attention-based explainable architectures \cite{djoumessi2025hybrid}, and intrinsically interpretable architectures \cite{bohle2022b}. 

    \clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
    % Acknowledgments---Will not appear in anonymized version
    \midlacknowledgments{This project was supported by the Hertie Foundation, the German Science Foundation (Excellence Cluster EXC 2064 ``Machine Learning—New Perspectives for Science'', project number 390727645; BE 5601/14-1, project number 571331899). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Kerol Djoumessi.}
    
    \bibliography{midl26_39}
    
    \newpage
    \appendix
    \input{midl26_39_suppl}
    
    \par\vspace{-\baselineskip}
\end{document}
