\newpage
%%%%%%%%%%%%%%
\appendix
% \addcontentsline{toc}{section}{Appendix} % Add the appendix text to the document TOC
% \part{Appendix} % Start the appendix part
% \parttoc % Insert the appendix TOC

This supplementary material contains the following points:
\begin{enumerate}
    \item Related work
    \item Datasets details
    \item Training details for standard WSOL and OOD scenarios
    \item Class separability index
    \item More localization performance analysis
    %\update{\item {Computational complexity}}
    \item Additional results with different backbones
    \item Computational complexity
    \item Additional results for classification (\cl) and localization (\pxap) accuracy on \camsixteen test set with different stainings
    \item Limitations and future work
    %\update{\item {Limitations and future work}}
    \item Ablations:
    \begin{enumerate}
        \item Impact of \pixelcam depth
        \item Impact of model selection of CAM pre-trained model
        \item Impact of pixel sampling technique
        \item Impact of the number of sampled pixels $n$
        \item Impact of $\lambda$
        \item Impact of using a pretrained WSOL model on same and external dataset for pseudo-labeling
        %\update{\item Impact of using a pretrained WSOL model on same and external dataset for pseudo-labeling}
    \end{enumerate}

    \item More visualizations results:
    \begin{enumerate}
        \item Standard WSOL setup
        \item Over Target set with OOD setup
    \end{enumerate}
\end{enumerate}


% ===============================================================================================
%                              RELATED WORK
% ===============================================================================================


%%%%%%%%%%
\section{Related Work}
\label{sec:related}

WSOL has emerged as a low-cost training setup to localize objects while classifying an image content~\cite{choe2020evaluating,rpwsol,zhou2017brief}. WSOL has also been extended to videos~\cite{belharbi23tcam,belharbi25colocam}. Moreover, there has been a recent focus on designing WSOL methods for histology image analysis~\cite{rony23}. This allows us to reduce the large annotation cost, in addition to building visually interpretable classifiers.

\noindent \textbf{Single-step WSOL}. Several approaches address the WSOL problem in a single step~\cite{choe2020evaluating,rpwsol,rony23} where a single model is trained to do both tasks, classification and localization. Usually, this is achieved by using a localization head followed by a spatial pooling layer to extract per-class probability. Early~\cite{deepmil,gradcam,zhou2016learning} but also recent WSOL works~\cite{SAT,zhu24,tscam} follow this strategy. Although this has brought significant improvements into the field, these models still face several limitations. Since only image-class labels are used as supervision without any localization cues, CAM localization can have poor performance, more so when dealing with less salient objects such as in histology images. Recent work~\cite{rony23} showed that without localization cues over histology images, these models often lead to highly unbalanced localization. This manifests either by under- or over-activations leading to high false negative/positive rates. In addition, this approach faces a great challenge in model selection as both tasks proceed in an asynchronous convergence~\cite{choe2020evaluating,rpwsol,rony23}. For instance, the model selected for best localization often yields poor classification.

\noindent \textbf{Two-step WSOL}.This strategy has emerged as a new line of research. In particular, a dual-model approach is used in combination with the usage of localization cues under the form of pseudo-labels~\cite{rpwsol,rony23}. This provides direct guidance for the localization task, bypassing the issue of task convergence. Several works simply create a dedicated model per task: one for classification and another for localization~\cite{wei2021shallowspol,zhang2020rethinking,zhao2023generative}. This leads to better performance since tasks are divided between two large models. However, this is a cumbersome approach as it drastically increases the number of parameters and training cycles. Most importantly, the localization CAMs are completely disconnected from the classification decision, making this strategy unreliable and less interpretable. A parallel direction overcomes this issue by using a decoder to act as a localizer on top of the classifier~\cite{negev,fcam,Murtaza2023dips}, in a similar way to a U-Net architecture~\cite{unet}. This creates a direct relation between both tasks. However, the decoder has a limited learning capacity since it is tied to a frozen encoder at several layers via skip connections. This prevents the localizer from a better adaptation, and it limits its performance leading to a sub-optimal solution. In addition, it adds a significant number of parameters to the classifier.

\noindent \textbf{WSOL vs OOD}. In addition to these limitations, WSOL methods face challenges when dealing with domain shift which degrades the performance of both tasks~\cite{sfdawsol}. Such shift is common in histology due to variations in stains, objects' structure, microscope type, and imaging centers. Further analysis shows that this issue could be rooted in the pixel features of the image encoder. Both classification and localization tasks depend heavily on these spatial features. Less flexible and poorly discriminant pixel features can lead to poor performance in both tasks since they both rely on these embeddings.

In summary, while existing WSOL methods have achieved great progress they still face several limitations. Single-step WSOL lacks the leverage of localization cues making them vulnerable to wrong localization when dealing with complex and less salient data such as histology images. This adds to the  well known issue of asynchronous convergence of localization and classification tasks. On the other hand, two-step methods are cumbersome in terms of training cycles, and number of parameters. In addition, either both tasks are disconnected leading to misaligned decisions or the localization has a tied capacity due to frozen backbone.
%
Our method comes as simple yet efficient alternative. It performs WSOL in a single-step within a multi-task framework where both tasks are optimized simultaneously. This leads to a single training cycle and facilitate convergence issue. It allows sharing parameters between both tasks making it parameter-efficient. But also, it can leverage localization cues such as pseudo-labels. Finally, well separating features at pixel level makes our model robust to OOD data.


% ===============================================================================================
%                              DATASET DETAILS
% ===============================================================================================

\section{Datasets}
\label{sec:datasets-details}

\paragraph{GlaS.}  
The \glas  dataset is used for the diagnosis of colon cancer. The dataset consists of 165 images from 16 Hematoxylin and Eosin (H\&E) and includes labels at both pixel-level and image-level (benign or malign). The dataset consists of 67 images for training, 18 for validation, and 80 for testing.  We use the same protocol as in~\citep{rony23, sfdawsol, negev}. Similarly, we use 3 samples per class with full supervision in validation set for model selection for localization (\beloc). 

\paragraph{CAMELYON16.} 
A patch-based benchmark~\cite{rony23} is extracted from the \camsixteen dataset that contains 399 Whole slide images categorized into two classes (normal and metastasic) for the detection of breast cancer metastases in H\&E-stained tissue sections of sentinel lymph nodes. Patch extraction of size ${512\times 512}$ follows a protocol established by~\citep{rony23, sfdawsol, negev} to obtain patches with annotations at the image and pixel level. 
The dataset contains a total of 48870 images, including 24348 for training, 8850 for validation, and 15664 for testing. From the validation dataset, 6 examples per class are randomly selected to be fully supervised to perform model selection for localization (\beloc) similarly to~\citep{rony23, sfdawsol, negev}.

\section{WSOL Training Details}
\label{sec:training-details}


For pretraining a WSOL baseline method on the data, we use the same setup as in~\cite{sfdawsol}. 
The first part of the model training is defined using the backbone trained on ImageNet~\cite{ImageNet}. The training is done using SGD with a batch size of 32~\cite{sfdawsol}. For the \glas dataset, training is performed over 1000 epochs, and 20 epochs for \camsixteen. A weight decay of $10^{-4}$ is also used. During training, images are resized to  ${256\times 256}$, then randomly cropped to ${224\times 224}$. A hyperparameter search was conducted for the learning rate parameters among the values \{0.0001, 0.001, 0.01\}, and its decaying factor among \{0.1, 0.4, 0.9\} following~\cite{sfdawsol}. In the second phase of training \pixelcam, we use the CAMs generated by the previous method and continue the training using the same setup as defined previously. 
For the OOD scenario, the source model is trained using the same setup on a source dataset and evaluated on a target dataset.




% ===============================================================================================
%                              CLASS SEPARABILITY INDEX
% ===============================================================================================


\section{Class Separability Index}
\label{sec:class_separability}

In this section, we present the class separability index between foreground (FG) and background (BG) classes at pixel-feature level over test set of both datasets \glas and \camsixteen. To measure how well FG/BG classes are separated in the feature space of pixels, we resort to the class separability index~\cite{Duda2000separability}.
The class separability, $J$, is based on the Within-class scatter matrix ($\bm{S}_{W}$) and the Between-class scatter matrix ($\bm{S}_{B}$) of pixel-features, defined as follows,

\begin{equation}
\bm{S}_W = \sum_{i=1}^{c} \left[ \sum_{j=1}^{n_i} (\bm{x}_{i,j} - \bm{m}_i)(\bm{x}_{i,j} - \bm{m}_i)^\top \right]\, ,
\end{equation}

\begin{equation}
\bm{S}_B = \sum_{i=1}^{c} n_i (\bm{m}_i - \bm{m})(\bm{m}_i - \bm{m})^\top \, ,
\end{equation}

\noindent where $c$ is the number of class (in our case ${c=2}$ for foreground and background). Note that this index $J$ is computed over a single image using the pixel-labels. 
Let $n_{i}\; (i=0,..,c)$ be the number of pixels in the $i$-th class. The feature vector $\bm{x}_{i,j} \in \mathbb{R}^{d}$ denotes the $j$-th pixel of the $i$-th class.  It represents the vector ${\mathsf{F}_p}$ for the pixel $p$ in the main paper. The mean vector of the $i$-th class is given by 
$\bm{m_{i}}$, while $\bm{x}$ represents the mean vector computed over all feature vectors. The class separability denoted $J$ is defined as follows~\cite{Duda2000separability},


\begin{equation}
J = \frac{\tr(\bm{S}_B)}{\tr(\bm{S}_W)}\,.
\end{equation}

In Table~\ref{tab:accuracy-class-separability-bloc-source} and \ref{tab:accuracy-class-separability-bloc-target}, we provide the average class separability over normal and cancer classes and the overview over the entire dataset.  As we can observe, \pixelcam improves the class separability between FG and BG of WSOL baseline methods on both dataset except for GradCAM++ on \camsixteen. This separability is explained by the high number of images with a low separability as illustrated in Fig~\ref{fig:histogram_camelyon_separability_source}. The improvement of \pixelcam in the OOD scenario compared to the WSOL baseline methods such as DeepMIL and LayerCAM on \glas and \camsixteen can be attributed to better separability as shown in Figs~\ref{fig:all_histogram_target_glas} and \ref{fig:all_histogram_target_camelyon}. We provide examples of features separability at the pixel level in section~\ref{sec:appendix_visualization}.


\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[!htbp]
\vspace{-4pt}
\centering 
\resizebox{0.999\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c c c |}
\hline
& & \multicolumn{3}{c|}{\glas} & \multicolumn{3}{c|}{\camsixteen} \\
%\cline{3-5}\cline{6-8}
 & \textbf{WSOL models} & Normal & Cancer & All & Normal & Cancer & All \\
\hline \hline

& DeepMil~\cite{deepmil} ICML & 0.13 & 0.07 & 0.10 & - & 0.11 & 0.11 \\
& DeepMil w/ \pixelcam & \textbf{0.21} & \textbf{0.09} & \textbf{0.14} & - & \textbf{0.17} & \textbf{0.17} \\
\hline
& GradCAM{\textit{++}}~\cite{gradcampp} WACV & 0.22 & 0.13 & 0.17 & - & \textbf{0.22} & \textbf{0.22} \\
& GradCAM{\textit{++}} w/ \pixelcam & \textbf{0.25} & \textbf{0.15} & \textbf{0.20} & - & 0.16 & 0.16 \\
\hline
& LayerCAM~\cite{layercam} IEEE TIP & \textbf{0.17}& 0.04 & 0.10 & - & 0.07 & 0.07 \\
& LayerCAM w/ \pixelcam & 0.14 & \textbf{0.12} & \textbf{0.13} & - & \textbf{0.17} & \textbf{0.17} \\
\hline
& SAT~\cite{SAT} ICCV & 0.21 & 0.06 & 0.13 & - & 0.17 & 0.17 \\
& SAT w/ \pixelcam & \textbf{0.25} & \textbf{0.10} & \textbf{0.17} & - & \textbf{0.22} & \textbf{0.22} \\
\hline
\end{tabular}
}
\caption{Standard WSOL setup: Average class separability index $J$ between FG/BG pixel-features on test set \glas and \camsixteen for WSOL baselines with and without \pixelcam. The higher $J$ is, the more classes are separated.}
\label{tab:accuracy-class-separability-bloc-source}
\vspace{-1em}
\end{table}

\begin{figure}[!htbp]
\centering
\includegraphics[width=0.999\linewidth]{visualization-CAMELYON512-all_histograms_source_camelyon.png}
\caption{Standard WSOL setup: Histogram of class separability index $J$ between FG/BG pixel-features on test set \camsixteen for WSOL baselines with and without \pixelcam. The higher $J$ is, the more classes are separated.}
\label{fig:histogram_camelyon_separability_source}
\end{figure}


\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[htbp]
\vspace{-4pt}
\centering 
\resizebox{0.999\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c c c |}
\hline
& & \multicolumn{3}{c|}{\camsixteen $\rightarrow$ \glas} & \multicolumn{3}{c|}{\glas $\rightarrow$ \camsixteen} \\
%\cline{3-5}\cline{6-8}
 &\textbf{WSOL models} & Normal & Cancer & All & Normal & Cancer & All \\
\hline \hline

& DeepMil~\cite{deepmil} ICML & 0.05 & 0.11 & 0.08 & - & \textbf{0.05} & \textbf{0.05} \\
& DeepMil w/ \pixelcam & \textbf{0.06} & \textbf{0.13} & \textbf{0.10} & - & \textbf{0.05} & \textbf{0.05} \\
\hline
& GradCAM{\textit{++}}~\cite{gradcampp} WACV & \textbf{0.08} & \textbf{0.19} & \textbf{0.14} & - & \textbf{0.08} & \textbf{0.08} \\
& GradCAM{\textit{++}} w/ \pixelcam & 0.05 & 0.09 & 0.07 & - & 0.07 & 0.07 \\
\hline
& LayerCAM~\cite{layercam} IEEE TIP & 0.02 & \textbf{0.13} & 0.08 & - & 0.04 & 0.04 \\
& LayerCAM w/ \pixelcam & \textbf{0.06} & 0.11 & \textbf{0.09} & - & \textbf{0.09} & \textbf{0.09} \\
\hline
& SAT~\cite{SAT} ICCV & 0.11 & 0.09 & 0.10 & - & 0.08 & 0.08 \\
& SAT w/ \pixelcam & \textbf{0.12} & \textbf{0.12} & \textbf{0.12} & - & \textbf{0.09} & \textbf{0.09} \\
\hline
\end{tabular}
}
\caption{OOD setup: Average class separability index $J$ between FG/BG pixel-features on target test set \glas and \camsixteen for both OOD cases: \mbox{\camsixteen $\rightarrow$ \glas} and \mbox{\glas $\rightarrow$ \camsixteen} for WSOL baselines with and without \pixelcam. The higher $J$ is, the more classes are separated.}
\label{tab:accuracy-class-separability-bloc-target}
\vspace{-1em}
\end{table}



\begin{figure}[!htbp]
\centering
\includegraphics[width=0.999\linewidth]{visualization-GLAS-all_histograms_target_glas.png}
\caption{OOD setup: Histogram of class separability index $J$ between FG/BG pixel-features on target test set \glas for the OOD case: \mbox{\camsixteen $\rightarrow$ \glas}  for WSOL baselines with and without \pixelcam. The higher $J$ is, the more classes are separated.}
  \label{fig:all_histogram_target_glas}
\end{figure}
\begin{figure}[!htbp]
\centering
\includegraphics[width=0.999\linewidth]{visualization-CAMELYON512-all_histograms_target_camelyon.png}
\caption{OOD setup: Histogram of class separability index $J$ between FG/BG pixel-features on target test set \camsixteen for the OOD case: \mbox{\glas $\rightarrow$ \camsixteen}  for WSOL baselines with and without \pixelcam. The higher $J$ is, the more classes are separated.}
  \label{fig:all_histogram_target_camelyon}
\end{figure}


% ===============================================================================================
%                              MORE LOCALIZATION ANALYSIS
% ===============================================================================================
% \newpage
\clearpage
\section{More Localization Performance Analysis}

\pxap is the primary metric used to measure localization performance. However, following~\cite{rony23}, we include other pixel-wise performance measures  that is true/false positives/negative rates.
As shown previously, our method \pixelcam achieves better results in terms of \pxap performance. This improvement is observed in Tab.~\ref{tab:mtx-conf-best-loc-glas-camelyon} with a higher number of true positive/negative rates compared to GradCAM++, LayerCAM, SAT, and NEGEV on the \glas dataset.
For the \camsixteen dataset, we observe that \pixelcam increases the true positive rate compared to standard methods while being competitive to NEGEV. However, we note a significant improvement when applying \pixelcam to the SAT method, with a considerable increase in true positives, thereby reducing the over-activation issue.

{
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.1}
\begin{table}[ht!]
\centering
\resizebox{0.99\textwidth}{!}{%
\centering
\small
\begin{tabular}{l l c*{4}{c}c*{4}{c}{c}}
& &  & \multicolumn{4}{c}{\glas} & & \multicolumn{4}{c}{\camsixteen} & \\
\cline{1-2}\cline{4-7}\cline{9-12} \\
\textbf{Bottom-up WSOL} &  &  & \pxtp \(\uparrow\)& \pxfn \(\downarrow\)& \pxtn \(\uparrow\)& \pxfp  \(\downarrow\)&  & \pxtp \(\uparrow\)& \pxfn \(\downarrow\)& \pxtn \(\uparrow\)& \pxfp \(\downarrow\)  \\
\cline{1-2}\cline{4-7}\cline{9-12}\\
DeepMIL~\citep{deepmil} ICML $\dagger$ &  &  & 63.4  & 36.6 & \underline{76.3} & \underline{23.6} &  & \textbf{66.9} & \textbf{33.1}& 89.6 &10.4&  \\
DeepMIL w/ NEGEV~\citep{negev} MIDL $\star$ &  &  & \textbf{79.0}  & \textbf{21.0} & 75.5 & 24.5 &  & \underline{64.0} & \underline{36.0}& \underline{92.6}  &\underline{7.4}&  \\
DeepMIL w/ \pixelcam $\dagger$ &  &  & \underline{76.4} & \underline{23.6}  & \textbf{77.4} & \textbf{22.6} &  &59.0& 41.0& \textbf{93.3}&\textbf{6.7}&  \\
\cline{1-2}\cline{4-7}\cline{9-12}\\
GradCAM++~\citep{gradcampp} WACV $\dagger$ &  &  & \underline{62.0}  & \underline{38.0} & \textbf{79.9} &\textbf{20.1} &  & 42.1&57.8&  89.4 &10.6&  \\
GradCAM w/ NEGEV~\citep{negev} MIDL $\star$ &  &  & 58.4  & 41.6 & 76.0 &24.0&  & \textbf{53.5}& \textbf{46.5}& \textbf{92.2}  & \textbf{7.8}&  \\
GradCAM w/ \pixelcam $\dagger$ &  &  & \textbf{76.9}  & \textbf{23.1} & \underline{78.8} & \underline{21.2} &  & \underline{49.2} & \underline{50.8} &  \underline{91.4} & \underline{8.6}&  \\
\cline{1-2}\cline{4-7}\cline{9-12}\\
LayerCAM~\citep{layercam} IEEE TIP $\dagger$ &  &  & 62.3  & 37.7 & \textbf{72.6} &  \textbf{27.4} &  &29.6& 70.3& 86.8 &13.2&  \\
LayerCAM w/ NEGEV~\citep{negev} MIDL $\star$ &  &  & \underline{66.3}  & \underline{33.7} & 70.7 & 29.3 &  & \underline{54.4} & \underline{45.6}& \textbf{90.9}  & \textbf{9.1}&  \\
LayerCAM w/ \pixelcam $\dagger$ &  &  & \textbf{77.6}  & \textbf{22.4} & \underline{72.5} & \underline{27.5} &  & \textbf{56.5}& \textbf{43.5}& \underline{89.2}  &\underline{10.8}&  \\
\cline{1-2}\cline{4-7}\cline{9-12}\\
\cline{1-2}\cline{4-7}\cline{9-12}\\
U-Net~\citep{unet} MICCAI &  &  & 88.9  & 11.1 & 89.8  & 10.2  &  & 68.0& 32.0 & 94.5 &5.5&  \\
\cline{1-2}\cline{4-7}\cline{9-12}\\
\end{tabular}
}
\caption{Confusion matrix performance over \glas and \camsixteen test set with standard WSOL setup. $\dagger$ refers to model with a single-step approach while $\star$ refers to model from two-step family.}
\label{tab:mtx-conf-best-loc-glas-camelyon}
\vspace{-1em}
\end{table}


}

\newpage
\section{Additional Results With Different Backbones}
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[ht!]
\vspace{-4pt}
\centering
\resizebox{.9\textwidth}{!}{
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\glas}  & & \multicolumn{2}{c}{\camsixteen} &\\
& \textbf{WSOL models} &  & \pxap  \(\uparrow\)& \cl \(\uparrow\) &  & \pxap \(\uparrow\) & \cl \(\uparrow\)& \\
\hline \hline
&  DeepMIL~\cite{deepmil} ICML $\dagger$   &  & 73.3 &\textbf{100.0}& &60.3 & \textbf{88.0}& \\
&  DeepMIL w/ NEGEV~\cite{negev} MIDL $\star$ &  &\underline{76.7} & \textbf{100.0}& &\underline{62.2}&86.6& \\
&  DeepMIL w/ \pixelcam  $\dagger$ &  & \textbf{78.6} &\textbf{100.0}& &\textbf{69.9}& \underline{85.4}& \\
\hline
&  GradCAM{\textit{++}}~\cite{gradcampp} WACV $\dagger$ &  & 72.9 &\textbf{100.0}& &27.7 & \underline{87.4}& \\
&  GradCAM{\textit{++}} w/ NEGEV~\cite{negev} MIDL $\star$ &  &\textbf{80.4}& \textbf{100.0}& &\textbf{68.3}& 87.3& \\
& GradCAM{\textit{++}} w/ \pixelcam  $\dagger$ & & \underline{76.9} &\textbf{100.0}& &\underline{64.9} & \textbf{87.7}& \\
\hline
&  LayerCAM~\cite{layercam} IEEE TIP $\dagger$ &  & 73.0&\textbf{100.0}& &24.5& \underline{88.7}& \\
&  LayerCAM w/ NEGEV~\cite{negev} MIDL $\star$ &  &\textbf{80.7}& \textbf{100.0}& &\textbf{68.7}&\textbf{88.1}& \\
&  LayerCAM w/ \pixelcam $\dagger$ & & \underline{76.3}&\textbf{100.0}& &\underline{64.2}&87.1& \\
\hline
\end{tabular}
}
\caption{Localization (\pxap) and classification (\cl) accuracy on \glas and \camsixteen test sets using different WSOL methods with VGG16 architecture. $\dagger$ refers to model with a one-step approach while $\star$ refers to model from two-step family.}
\label{tab:accuracy-bloc-source-vgg16}
\vspace{-2.5em}
\end{table}


\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[ht!]
\vspace{-4pt}
\centering
\resizebox{.9\textwidth}{!}{
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\glas}  & & \multicolumn{2}{c}{\camsixteen} &\\
& \textbf{WSOL models} &  & \pxap  \(\uparrow\)& \cl \(\uparrow\) &  & \pxap \(\uparrow\) & \cl \(\uparrow\)& \\
\hline \hline
&  DeepMIL~\cite{deepmil} ICML $\dagger$   &  & 62.6 &87.5& &58.1 & \underline{89.0}& \\
&  DeepMIL w/ NEGEV~\cite{negev} MIDL $\star$ &  &\underline{66.1} & \textbf{91.2}& &\textbf{76.8}&\textbf{89.5}& \\
&  DeepMIL w/ \pixelcam  $\dagger$ &  & \textbf{72.0} &\underline{88.8}& &\underline{75.7}& 83.2& \\
\hline
&  GradCAM{\textit{++}}~\cite{gradcampp} WACV $\dagger$ &  & 70.0 &53.8& &64.0 & \underline{88.4}& \\
&  GradCAM{\textit{++}} w/ NEGEV~\cite{negev} MIDL $\star$ &  &\textbf{81.7}& \underline{93.7}& &\underline{74.1}& 86.9& \\
& GradCAM{\textit{++}} w/ \pixelcam  $\dagger$ & & \underline{81.0} &\textbf{96.3}& &\textbf{75.0} & \textbf{89.8}& \\
\hline
&  LayerCAM~\cite{layercam} IEEE TIP $\dagger$ &  & 68.3&92.5& &53.7& \textbf{83.8}& \\
&  LayerCAM w/ NEGEV~\cite{negev} MIDL $\star$ &  &\underline{77.8}& \underline{93.8}& &\textbf{71.6}&\textbf{83.8}& \\
&  LayerCAM w/ \pixelcam $\dagger$ & & \textbf{78.3}&\textbf{100.0}& &\underline{65.5}&\underline{82.5}& \\
\hline
\end{tabular}
}
\caption{Localization (\pxap) and classification (\cl) accuracy on \glas and \camsixteen test sets using different WSOL methods with InceptionV3 architecture. $\dagger$ refers to model with a one-step approach while $\star$ refers to model from two-step family.}
\label{tab:accuracy-bloc-source-inceptionv3}
\vspace{-2.5em}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%T-STAT%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[ht!]
\centering
\resizebox{0.93\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c | c c |}
\hline
& & \multicolumn{2}{c|}{\glas} & \multicolumn{2}{c|}{\camsixteen} \\
& \textbf{WSOL models} & \textbf{T-stat} $\uparrow$ & \textbf{p-value} $\downarrow$ & \textbf{T-stat} $\uparrow$ & \textbf{p-value} $\downarrow$ \\
\hline \hline
& DeepMIL~\cite{deepmil} ICML $\dagger$ & 16.8 & $3.4 \times 10^{-16}$ & 9.3 & $5.1 \times 10^{-10}$ \\
& GradCAM{\textit{++}}~\cite{gradcampp} WACV $\dagger$ & 11.2 & $7.9 \times 10^{-12}$ & 139.1 & $2.1\times 10^{-41}$ \\
& LayerCAM~\cite{layercam} IEEE TIP $\dagger$ & 6.6 & $3.9 \times 10^{-7}$ & 180.1 & $1.9 \times 10^{-44}$ \\
\hline
\end{tabular}
}
\caption{T-test statistics between baselines (DeepMIL, GradCAM{\textit{++}}, LayerCAM) and our method (\pixelcam) for localization performance with VGG16 backbone.}
\label{tab:tstat-results-vgg16}
\vspace{-2.5em}
\end{table}


\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[ht!]
\centering
\resizebox{0.93\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c | c c |}
\hline
& & \multicolumn{2}{c|}{\glas} & \multicolumn{2}{c|}{\camsixteen} \\
& \textbf{WSOL models} & \textbf{T-stat} $\uparrow$ & \textbf{p-value} $\downarrow$ & \textbf{T-stat} $\uparrow$ & \textbf{p-value} $\downarrow$ \\
\hline \hline
& DeepMIL~\cite{deepmil} ICML $\dagger$ & 17.8 & $8.2 \times 10^{-17}$ & 18.0 & $6.5 \times 10^{-17}$ \\
& GradCAM{\textit{++}}~\cite{gradcampp} WACV $\dagger$ & 19.3 & $1.0 \times 10^{-17}$ & 10.9 & $1.5\times 10^{-11}$ \\
& LayerCAM~\cite{layercam} IEEE TIP $\dagger$ & 13.7 & $5.8 \times 10^{-14}$ & 11.0 & $1.1 \times 10^{-11}$ \\
\hline
\end{tabular}
}
\caption{T-test statistics between baselines (DeepMIL, GradCAM{\textit{++}}, LayerCAM) and our method (\pixelcam) for localization performance with InceptionV3 backbone.}
\label{tab:tstat-results-inceptionv3}
\vspace{-2.5em}
\end{table}




\section{Computational Complexity}
The computation overhead of the pixel classifier in \pixelcam is minimal in our context. It is suitable for analyzing whole slide images (WSIs) that contain millions of pixels \citep{rony23}. From a memory point of view, the impact is negligible since the number of supplementary parameters depends essentially on feature size. For instance, in the ResNet50-based architecture, the feature size is equal to 2048, and the number of classes is equal to two (FG and BG). The pixel classifier adds only 4,098 parameters to the CNN-based architecture, which already contains more than 23.5M parameters. The same conclusion applies to the transformer-based architecture ($>$ 5.5M parameters), where we add 386 parameters only.

For the inference computational cost, our pixel classifier is particularly advantageous, especially compared to WSOL gradient-based models for WSOL (see Tab.\ref{tab:inference-time-wsol}). Therefore, our model incurs negligible computation overhead, allowing for fast training and inference. The time efficiency, combined with the robustness of PixelCAM, makes it a key advantage for deployment in practical scenarios where WSI image sizes can be extremely large.

\begin{table}[!htbp]
    \centering
    %{\color{red}
    \begin{tabular}{|c|c|c|}
        \hline
       \textbf{WSOL Models} &  \textbf{Inference time} \(\downarrow\) & \textbf{No. parameters}\\
        \hline \hline
       DeepMIL & \textbf{9.1}ms & 24,036,932\\
       DeepMIL w/ NEGEV & 12.3ms & 33,050,150\\
       DeepMIL w/ PixelCAM & 9.2ms & 24,041,030\\
        \hline
        GradCAM ++ & 40.9ms & 23,514,179\\
        GradCAM ++ w/ NEGEV & 10.3ms & 32,527,397\\
        GradCAM ++ w/ PixelCAM & \textbf{8.8}ms &  23,518,277\\
        \hline
        LayerCAM&  37.9ms & 23,514,179\\
        LayerCAM w/ NEGEV & 10.4ms & 32,527,397\\
        LayerCAM w/ PixelCAM & \textbf{8.4}ms & 23,518,277\\
        \hline
        SAT &  \textbf{10.5}ms & 5,528,267\\
        SAT w/ PixelCAM &  11.5ms &5,528,651\\
        \hline
        U-Net  &  \textbf{10.1}ms & 32,521,250\\
        \hline
    \end{tabular}
    %}
    \caption{Inference time required to produce CAMs using different WSOL methods with a ResNet50 architecture for CNN-based models and DeiT-Tiny for Transformer-based model. The time needed to build a full-size CAM is estimated using an NVIDIA RTX A6000 GPU for one random RGB image of size 224 × 224.}
   \label{tab:inference-time-wsol}
\end{table}


\section{Additional Results for Classification (\cl) and Localization (\pxap) Accuracy on \camsixteen Test Set with Different Stainings.}
\label{sec:staining}

In this section, we extend the experiment initially conducted on the \glas dataset to the \camsixteen dataset. We modify the stainings in the \camsixteen test set to evaluate the robustness of \pixelcam compare to the baseline method (LayerCAM). \pixelcam consistently outperforms the baseline in terms of robustness on both classification and localization tasks. 


\begin{figure}[!ht]
  \includegraphics[width=\linewidth]{visualization-diff_stainings-classification_comparison_camelyon.png}
  \caption{Classification (\cl) accuracy on \camsixteen test sets with LayerCAM and \pixelcam with different stainings.
  }
  \label{fig:visual-example-glas-cl}
\end{figure}

\begin{figure}[!ht]
  \includegraphics[width=\linewidth]{visualization-diff_stainings-localization_comparison_camelyon.png}
  \caption{Localization (\pxap) accuracy on \camsixteen test sets with LayerCAM and \pixelcam with different stainings.
  }
  \label{fig:visual-example-glas-loc}
\end{figure}


\section{Limitations and Future Work}

WSOL methods are known to underperform when applied to new datasets due to domain shift \textcolor{red}{\citep{sfdawsol}}. Although our model is impacted by this issue, it demonstrates improved performance in terms of \pxap (localization) and \cl (classification). However, it may still face difficulties in maintaining the same level of performance as on the source dataset, particularly in the presence of extreme shifts (e.g., from \glas to \camsixteen).
To mitigate these limitations, several domain adaptation strategies could be explored. Many existing approaches rely on feature alignment techniques, such as contrastive learning or distribution alignment, to reduce domain shift. Our new pixel classifier can serve exactly as the image classifier by adapting such techniques at the pixel level.  Additionally, in domain adaptation context, most of the techniques use clustering techniques to refine pseudo labels. Since our model produces more discriminative features (as shown in Tables~\ref{tab:accuracy-class-separability-bloc-source} and \ref{tab:accuracy-class-separability-bloc-target}
), \pixelcam can improve clustering effectiveness leading to more reliable pixel pseudo-label adaptation approaches.


% ===============================================================================================
%                                         ABLATIONS
% ===============================================================================================
\clearpage
\section{Ablations}
\label{sec:ablations}

We provide in this section several ablations of our method \pixelcam. 


% ===============================================================================================
%                                         IMPACT OF PIXELCAM DEPTH
% ===============================================================================================

\subsection{Impact of \pixelcam Depth}
\label{subsec:ablation-depth}

All the reported results of our method are obtained with a linear classifier that classifies pixel-embeddings into  FG/BG classes. In this section, we further explore the impact of using a multi-layer classifier composed of three 1×1 convolutional layers, which act as fully connected layers by reducing the dimensionality of the previous layer's output by a factor of 2.


\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[!ht]
\centering
\resizebox{0.99\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\glas}  & & \multicolumn{2}{c}{\camsixteen} &\\
& \textbf{WSOL models} &  & \pxap \(\uparrow\)& \cl \(\uparrow\)&  & \pxap \(\uparrow\)& \cl \(\uparrow\)& \\
\hline \hline
&  DeepMIL~\cite{deepmil} ICML &  & 79.9 &\textbf{100.0}& &71.3 &85.0& \\
& DeepMIL w/ linear \pixelcam &  & \underline{85.5} &\textbf{100.0}& &\textbf{75.7} & \underline{88.2}& \\
&  DeepMIL w/ multi-layer \pixelcam &  & \textbf{86.3} & 95.0 & &74.4 & \textbf{88.6}& \\
\hline
&  GradCAM{\textit{++}}~\cite{gradcampp} WACV &  & 77.9 &\textbf{100.0}& &49.1 & 63.4& \\
& GradCAM{\textit{++}} w/ linear \pixelcam &  & \textbf{86.6} &\textbf{100.0}& &\underline{63.4}&\textbf{88.7}& \\
& GradCAM{\textit{++}} w/ multi-layer \pixelcam &  & 86.5 &95.0& &\textbf{64.0}& \underline{85.8}& \\
\hline
\end{tabular}
}
\caption{Localization (\pxap) and classification (\cl) accuracy on \glas and \camsixteen test set with linear and multi-layer \pixelcam for Standard WSOL setup.}
\label{tab:accuracy-ml-bloc-source}
\vspace{1em}
\end{table}



As shown in Tab.~\ref{tab:accuracy-ml-bloc-source}, adding hidden layers to the pixel classifier is not necessarily an advantage for the localization task. Indeed, models with multiple layers perform worse on \glas and \camsixteen for the WSOL baseline methods GradCAM++~\cite{gradcampp} and DeepMIL~\cite{deepmil}, respectively. In the case of OOD data, Tab.\ref{tab:accuracy-ml-bloc-target} shows that large performance degradation can observed when using multi-layer classifier. Therefore, as a general rule, we recommend using simply a linear pixel-classifier.

\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[!ht]
\vspace{2em}
\centering
\resizebox{0.99\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\camsixteen $\rightarrow$ \glas}  & & \multicolumn{2}{c}{\glas  $\rightarrow$ \camsixteen} &\\
& \textbf{WSOL models} &  & \pxap \(\uparrow\)& \cl \(\uparrow\)&  & \pxap \(\uparrow\)& \cl \(\uparrow\)& \\
\hline \hline
&  DeepMIL~\cite{deepmil} ICML &  &64.5 &\underline{81.2}& &29.0 &\textbf{55.2}& \\
&   DeepMIL w/ linear \pixelcam &  &\textbf{69.1} &\textbf{83.8}& &\textbf{30.2} &\underline{52.5}& \\
&   DeepMIL w/ multi-layer \pixelcam &  &\underline{67.4} &80.0& &25.0 &51.6& \\
\hline
&  GradCAM{\textit{++}}~\cite{gradcampp} WACV &  & 52.9 &53.7& &\textbf{39.1} &\underline{52.4}& \\
&   GradCAM{\textit{++}} w/ linear \pixelcam &  & \textbf{56.2} &\textbf{71.2}& &\underline{36.9}& \textbf{63.3}& \\
&   GradCAM{\textit{++}} w/ multi-layer \pixelcam &  & \underline{55.8} &\textbf{71.2}& &33.8& 50.6& \\
\hline
\end{tabular}
}
\caption{Localization (\pxap) and classification (\cl) accuracy on \glas and \camsixteen test set with linear and multi-layer \pixelcam for OOD setup.}
\label{tab:accuracy-ml-bloc-target}
\end{table}






% ===============================================================================================
%                      IMPACT OF MODEL SELECTION OF CAM PRETRAINED MODEL
% ===============================================================================================

\subsection{Impact of Model Selection of CAM Pre-trained Model} 
\label{subsec:ablation_cams}

Our method leverages pseudo-labels extracted from a pretrained WSOL CAM-based model. However, due to the asynchronous convergence of classification and localization tasks~\cite{rony23}, the criterion used for model selection is expected to have an impact on CAM quality, and therefore, pseudo-labels accuracy. Typically, models are selected based either on their classification performance on validation set by taking the best classifier (\becl), or their localization performance and considering the best localizer (\beloc). In Tab.\ref{tab:accuracy-bclcam-source}, we present the impact of such choice on our method \pixelcam.


\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[!ht]
\centering
\resizebox{0.99\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\glas}  & & \multicolumn{2}{c}{\camsixteen} &\\
& \textbf{WSOL models} &  & \pxap \(\uparrow\)& \cl \(\uparrow\)&  & \pxap \(\uparrow\)& \cl \(\uparrow\)& \\
\hline \hline
&  LayerCAM~\cite{layercam} IEEE TIP &  &75.1 &\textbf{100.0}& &33.2 &84.8& \\
&  LayerCAM w/ \pixelcam  (\beloc CAM) &  &\textbf{83.6}&\textbf{100.0}& &\underline{66.2}& \underline{89.1}& \\
&   LayerCAM w/ \pixelcam (\becl CAM) &  & \underline{80.6} &98.8& &\textbf{67.5}& \textbf{89.4}& \\
\hline
& SAT~\cite{SAT} ICCV &  &65.9 &98.8& &32.8 &83.2& \\
&  SAT w/ \pixelcam (\beloc CAM) &  & \textbf{79.1}&\textbf{100.0}& &\textbf{51.2}& \textbf{87.2}& \\
&  SAT w/ \pixelcam (\becl CAM) &  & \underline{75.5} &\textbf{100.0}& &\underline{35.6}& \underline{86.4}& \\
\hline
\end{tabular}
}
\caption{Impact of model selection of CAM-based model to build pseudo-labels: Comparison of \pixelcam on \glas and \camsixteen datasets on test set by using CAMs from \beloc and \becl models respectively. The performance is compared to the standard WSOL setup.}
\label{tab:accuracy-bclcam-source}
\end{table}



Initially, the CAMs used for pixel selection were generated from the \beloc model. We compared the performance compared to using \becl's CAMs. Table~\ref{tab:accuracy-bclcam-source} shows the results of various experiments with different WSOL baseline methods. We noticed that \pixelcam improves localization performance regardless of whether \beloc or \becl is used. However, for techniques that use gradients, such as GradCAM++~\cite{gradcampp} and LayerCAM~\cite{layercam}, CAMs generated with \becl provide better performance. This can be explained by the nature of the CAMs produced by \beloc for these methods. Specifically, \beloc generates CAMs with better localization performance due to higher true positive rates but also introduces more false positives. This negatively impacts training, as incorrect labels are incorporated, potentially reducing overall performance.


We also observe a significant difference for SAT method~\cite{SAT}. This is mainly explained by the fact that the CAMs generated by the \becl model produce a high number of false positives, similar to those generated by \beloc. However, the \beloc model generates a higher true positive rate than \becl, which is crucial for the strong performance of \pixelcam.



% ===============================================================================================
%                         IMPACT OF PIXEL SAMPLING TECHNIQUE ON PIXELCAM
% ===============================================================================================

\subsection{Impact of Pixel Sampling Technique}
\label{subsec:ablation_sampling}

Our method \pixelcam uses pseudo-annotation for the pixel classifier's training. In particular, we consider random sampling of pixel locations to generate pseudo-labels as it has shown to yield better performance than fitting static regions~\citep{negev}. This avoids overfitting regions and promote exploring potential ROIs.
In this section, we investigate the impact of various sampling techniques to identify pixels associated to FG and BG. We consider two different sampling approaches namely \emph{Threshold-based}~\cite{fcam} and \emph{Probability-based (PB)}~\cite{negev}. The \emph{Threshold-based (TH)} approach automatically thresholds the CAM. Then, it considers all the pixels with activation above the threshold as foreground and valid for FG sampling. This is delineated by a mask. FG pixels are sampled uniformly within that mask, while BG pixels are sampled from outside the mask. 
The \emph{Probability-based} approach samples FG pixels proportionally to the probability values obtained from the CAM using a multinomial distribution. However, the BG pixels are samples from (1 - CAM) activations so regions with low activations will have higher sampling chance.

\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[!ht]
\centering
\resizebox{0.99\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\glas}  & & \multicolumn{2}{c}{\camsixteen} &\\
& \textbf{WSOL models} &  & \pxap \(\uparrow\)& \cl \(\uparrow\)&  & \pxap \(\uparrow\)& \cl \(\uparrow\)& \\
\hline \hline
&  GradCAM{\textit{++}}~\cite{gradcampp} WACV &  & 76.8 &\textbf{100.0}& &49.1& 63.4& \\
& GradCAM{\textit{++}} w/ TH \pixelcam &  & 85.4 & \textbf{100.0} & &\underline{54.2}& \textbf{89.9}& \\
& GradCAM{\textit{++}} w/ PB \pixelcam &  & \textbf{86.6} &\textbf{100.0}& &\textbf{64.1}& \underline{85.1}& \\
\hline
&  SAT~\cite{SAT} ICCV &  & 65.9 &\underline{98.8}& &32.8 &\underline{83.2}& \\
& SAT w/ TH \pixelcam &  & \underline{71.5} &92.5& &\underline{50.9} & 81.0& \\
&  SAT w/ PB \pixelcam &  & \textbf{79.1} &\textbf{100.0} & &\textbf{51.2} & \textbf{87.2}& \\
\hline
\end{tabular}
}
\caption{Impact of pixel sampling technique on \pixelcam performance: Threshold-based (TH) vs. probability-based (PB), over standard WSOL setup.}
\label{tab:sampling_exps}
\end{table}



In Tab.~\ref{tab:sampling_exps}, we observe that pixel sampling technique has a significant impact on the localization performance of \pixelcam. For \glas dataset, using GradCAM++ CAMs has a small impact, but for CAMs generated from a transformer architecture as SAT the impact is significant with a difference of 7.6\%. Similarly, on a more challenging dataset as \camsixteen, using a probabilistic approach for sampling pixels can lead to an improvement of 9.9\%. This can be explained by the fact that sampling without relying on a threshold favors relevant pixels and most likely to have the correct pseudo-label. On the other hand, threshold-based method fixes a region for sampling with high likelihood of covering wrong regions. Learning with incorrect labels leads to poor models, and, therefore poor performance.

% ===============================================================================================
%                     IMPACT OF THE NUMBER OF SAMPLED PIXELS
% ===============================================================================================

\subsection{Impact of the Number of Sampled Pixels  \texorpdfstring{$n$}{n}}
\label{subsec:ablation_nb_pixel}

We analyze the impact of the number of pixels selected as pseudo-labels to train \pixelcam. To avoid unbalanced pixel classification, we sample the same number $n$ of pixels as FG and BG. We consider the following cases ${n \in \{1, 5, 10, 20\}}$.
As observed in Tab.~\ref{tab:accuracy-pixels}, increasing $n$ can affect both localization and classification with different degree, and depending on the dataset. In terms of localization, both datasets can slightly be affected. However, \camsixteen is largely affected in term of classification as performance can vary between ${85.7\%}$ and ${90.1\%}$.


\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[!ht]
\centering 
\resizebox{0.99\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\glas}  & & \multicolumn{2}{c}{\camsixteen} &\\
& \textbf{WSOL models} &  & \pxap \(\uparrow\)& \cl \(\uparrow\)&  & \pxap \(\uparrow\)& \cl \(\uparrow\)& \\
\hline \hline
& GradCAM{\textit{++}}~\cite{gradcampp} WACV &  & 76.8 &\textbf{100.0}& &49.1& 63.4& \\
& GradCAM{\textit{++}} w/ \pixelcam: &  &   & & & & & \\
& $n=1$  &  & \textbf{86.7} &\textbf{100.0}& &63.5&\textbf{90.1}& \\
& $n=5$  &  & \underline{86.6} &\textbf{100.0}& &\underline{64.1}&85.1& \\
& $n=10$ &  & 86.0 &\textbf{100.0}& &63.2&\underline{88.8}& \\
& $n=20$ &  & 86.0 &\textbf{100.0}& &\textbf{64.7}&88.7& \\
\hline
\end{tabular}
}
\caption{Impact of the number of sampled pixel as pseudo-labels $n$ on \pixelcam performance. 
We measure the \pxap and \cl performance over the test set for \glas and \camsixteen for standard WSOL setup.}
\label{tab:accuracy-pixels}
\end{table}



% ===============================================================================================
%                               IMPACT OF LAMBDA
% ===============================================================================================

\subsection{Impact of \texorpdfstring{$\lambda$}{lambda}}
\label{subsec:ablation_lambda}

Table~\ref{tab:accuracy-lambda-bloc-source} shows the impact of $\lambda$ on the performance of our method \pixelcam. It achieves better results when using a higher value of $\lambda$ on \glas. As $\lambda$ decreases, the \pxap performance also declines. On the \camsixteen dataset, we note that using a low $\lambda$ value (0.001) significantly impacts localization performance. Therefore, we recommend using $\lambda$ values in the range of 0.1 to 1.0 for better performance.

\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[!ht]
% \vspace{-4pt}
\centering 
\resizebox{0.5\textwidth}{!}{
\small
\begin{tabular}{ | l  l | c c c | c  c c c|}
\hline
& &  & \multicolumn{2}{c}{\glas}  & & \multicolumn{2}{c}{\camsixteen} &\\
%\cline{4-5}\cline{7-8} \\
 & \textbf{$\lambda$} &  & \pxap \(\uparrow\)& \cl \(\uparrow\)&  & \pxap \(\uparrow\)& \cl \(\uparrow\)& \\
\hline \hline
&  1 &  & \textbf{83.6}&\textbf{100.0}& &66.2&89.1& \\
& 0.5 &  & \underline{83.1}&\textbf{100.0}& &66.6&\textbf{89.7}& \\
& 0.1  &  & 82.2&\textbf{100.0}& &\textbf{67.4}&88.3& \\
& 0.01  &  & 79.3&96.3& &\underline{66.7}&89.1& \\
& 0.001  &  & 76.1&98.8& &61.6&\underline{89.2}& \\
\hline
\end{tabular}
}
\caption{Impact of hyper-parameter $\lambda$ over \pixelcam in terms of \pxap and \cl performance over test set. \pixelcam uses LayerCAM~\cite{layercam} for pseudo-labels.
}
\label{tab:accuracy-lambda-bloc-source}
% \vspace{-1em}
\end{table}

\subsection{Impact of using a pretrained WSOL model on same and external dataset for pseudo-labeling}

As mentionned, PixelCAM use a WSOL CAM-based model to obtain pseudo label for foreground and background. In our paper, we consider training a standard classic WSOL CAM-based (DeepMIL, GradCAM++, LayerCAM, SAT) trained on the same dataset to obtain a robust model for the pseudo labeling to avoid the number of false positive. Considering a WSOL CAM-based model trained on an external dataset can considered to avoid computation time during training as it doesn't require a pre step training but tends to perform poorly on a unseen dataset which will cause a high number of false positive and negative leading to a poor pseudo labeling. To support this claim we trained LayerCAM on \glas to obtain pseudo label on \camsixteen and also trained the model on \camsixteen to obtain pseudo labels on \glas. 

\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{table}[htbp]
\vspace{-4pt}
\centering
\resizebox{0.85\textwidth}{!}{%
%{\color{red}
\begin{tabular}{|l|cc|cc|cc|cc|}
\hline
& \multicolumn{4}{c|}{\glas} & \multicolumn{4}{c|}{\camsixteen} \\
& \multicolumn{2}{c|}{Same} & \multicolumn{2}{c|}{External} & \multicolumn{2}{c|}{Same} & \multicolumn{2}{c|}{External} \\
& \pxap \(\uparrow\) & \cl \(\uparrow\) & \pxap \(\uparrow\) & \cl \(\uparrow\) & \pxap \(\uparrow\) & \cl \(\uparrow\) & \pxap \(\uparrow\) & \cl \(\uparrow\) \\
\hline \hline
LayerCAM & 75.1 & \textbf{100.0} & 58.1 & 83.8 & 33.2 &84.8& 22.7 & 50.4\\
LayerCAM w/ PixelCAM & \textbf{83.6} & \textbf{100.0} & \textbf{68.1} & \textbf{92.5} & \textbf{66.2} & \textbf{89.1} & \textbf{59.2} & \textbf{81.7} \\
\hline
\end{tabular}
}
%}
\caption{Localization (\pxap) and classification (\cl) accuracy on \glas and \camsixteen test sets with VGG16 and ResNet50 backbones.}
\label{tab:ablation-external-ds}
\vspace{-1em}
\end{table}


As we can observe (Tab.\ref{tab:ablation-external-ds}), the performance of using a WSOL CAM-based method on a external avoid computation cost on the training part but significantly impact the training of PixelCAM and make it less suitable compare to a standard WSOL CAM-based method on GlaS dataset.







\clearpage
%\vspace{5}

% ===============================================================================================
%                                         MORE VISUALIZATION
% ===============================================================================================

\section{More Visualization}
\label{sec:appendix_visualization}

\subsection{Visualization Results on Standard WSOL setup}
In this section, we present visual results of \pixelcam compared to WSOL baseline methods. In a standard WSOL setup, we observe that \pixelcam enhances the ROIs detected by WSOL baseline methods on the \glas dataset (Fig:~\ref{fig:example-normal-glas-wsol} and \ref{fig:example-cancer-glas-wsol}). Regarding \camsixteen, a challenging dataset, \pixelcam improves localization for cancer images by extending ROIs and reducing incorrect predictions from WSOL baseline methods as observed in (Fig:~\ref{fig:example-normal-cam16-wsol}). 
\begin{figure}[!ht]
\centering
  \includegraphics[width=0.85\linewidth]{PixelCAM_GLAS_source_normal-compressed.png}
  \caption{Standard WSOL setup. First column: \textbf{Normal} images from \glas. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }
  \label{fig:example-normal-glas-wsol}
\end{figure}

\begin{figure}[!ht]
\centering
  \includegraphics[width=0.92\linewidth]{PixelCAM_GLAS_source_cancer-compressed.png}
  \caption{Standard WSOL setup. First column: \textbf{Cancerous} images from \glas. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }
  \label{fig:example-cancer-glas-wsol}
\end{figure}

\begin{figure}[!ht]
\centering
  \includegraphics[width=0.999\linewidth]{PixelCAM_CAMELYON_source_normal-compressed.png}
  \caption{Standard WSOL setup. First column: \textbf{Normal} images from \camsixteen. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }
  \label{fig:example-normal-cam16-wsol}
\end{figure}



\begin{figure}[!ht]
  \centering
  \includegraphics[width=0.999\linewidth]{PixelCAM_CAMELYON_source_cancer-compressed.png}
  \caption{Standard WSOL setup. First column: \textbf{Cancerous} images from \camsixteen. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }
  \label{fig:example-cancer-cam16-wsol}
\end{figure}


\clearpage
\subsection{Visualization Results on Target set with OOD setup}
We provide visual results on target set in the case of OOD for both scenarios: \mbox{\camsixteen $\rightarrow$ \glas} and \mbox{\glas $\rightarrow$ \camsixteen}. As we observe, \pixelcam can predict correctly in average cancer ROIs as illustrated in Fig:~\ref{fig:example-cancer-glas-ood} and \ref{fig:example-cancer-cam16-ood} but struggle with normal images as the WSOL baseline methods (Fig:~\ref{fig:example-normal-glas-ood} and \ref{fig:example-normal-cam16-ood}).  

\begin{figure}[!ht]
\centering
\includegraphics[width=0.9\linewidth]{PixelCAM_GLAS_target_normal-compressed.png}
  \caption{OOD setup:  \mbox{\camsixteen $\rightarrow$ \glas}. First column: \textbf{Normal} images from \glas. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }
  \label{fig:example-normal-glas-ood}
\end{figure}

\begin{figure}[!ht]
\centering
  \includegraphics[width=0.9\linewidth]{PixelCAM_GLAS_target_cancer-compressed}
  \caption{OOD setup:  \mbox{\glas $\rightarrow$ \camsixteen}. First column: \textbf{Cancerous} images from \glas. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }  
  \label{fig:example-cancer-glas-ood}
\end{figure}

\begin{figure}[!ht]
\centering
  \includegraphics[width=0.9\linewidth]{PixelCAM_CAMELYON_target_normal-compressed}
  \caption{OOD setup:  \mbox{\glas $\rightarrow$ \camsixteen}. First column: \textbf{Normal} images from \camsixteen. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }  
  \label{fig:example-normal-cam16-ood}
\end{figure}




\begin{figure}[!ht]
\centering
  \includegraphics[width=0.99\linewidth]{PixelCAM_CAMELYON_target_cancer-compressed.png}
  \caption{OOD setup:  \mbox{\glas $\rightarrow$ \camsixteen}. First column: \textbf{Cancerous} images from \camsixteen. Second column: Ground truth. Next columns: We display the visual CAM results for WSOL baseline without and with \pixelcam, respectively.
  }  
  \label{fig:example-cancer-cam16-ood}
\end{figure}



\clearpage


\begin{figure}[!ht]
\centering
  \begin{tabular}{@{}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}}
  % \begin{tabular}{ccccc}
    \makecell{Input} & \makecell{DeepMil} & \makecell{GradCAM++} & \makecell{LayerCAM} & \makecell{SAT} \\
    % Input & DeepMil & GradCAM++ & LayerCAM & SAT \\
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-testA_31.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-SAT.png} 
    \\
    &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_31-PixelCAM-SAT_logits_plot.png} \\    
     \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-testA_20.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testA_20-PixelCAM-SAT_logits_plot.png} \\
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-testB_5.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-SAT.png} \\
    &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-SAT.png} \\
    &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-source-testB_5-PixelCAM-SAT_logits_plot.png}
  \end{tabular}
  \caption{
  Standard WSOL setup: For the three \textbf{normal} images from \glas, we display the t-SNE projection of foreground and background pixel-features of WSOL baseline methods without (1st row) and with (2nd row) \pixelcam. The 3rd row presents the logits of the pixel classifier of \pixelcam.
  }
  \label{fig:feature-tsne-source-normal-glas}
\end{figure}


\begin{figure}[!ht]
\centering
  \begin{tabular}{@{}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{}}
    \makecell{Input} & \makecell{DeepMil} & \makecell{GradCAM++} & \makecell{LayerCAM} & \makecell{SAT} \\
   \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-testA_26.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-SAT.png} \\

     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_26-PixelCAM-SAT_logits_plot.png} \\
    
    
     \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-testA_38.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_38-PixelCAM-SAT_logits_plot.png} \\
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-testA_3.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-SAT.png} \\
    &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-source-testA_3-PixelCAM-SAT_logits_plot.png} 
  \end{tabular}
  \caption{
  Standard WSOL setup: For the three \textbf{cancerous} images from \glas, we display the t-SNE projection of foreground and background pixel-features of WSOL baseline methods without (1st row) and with (2nd row) \pixelcam. The 3rd row presents the logits of the pixel classifier of \pixelcam.
  }
  \label{fig:feature-tsne-source-cancer-glas}
\end{figure}

\begin{figure}[!ht]
\centering
  \begin{tabular}{@{}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{}}
    \makecell{Input} & \makecell{DeepMil} & \makecell{GradCAM++} & \makecell{LayerCAM} & \makecell{SAT} \\
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-SAT.png} \\
    &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-SAT.png} \\
    &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_016.tif_reg_11_row_25_p_0_x_40101_y_77036_t_68_m_61_w_512_h_512-PixelCAM-SAT_logits_plot.png} \\
    
    
     \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_069.tif_reg_7_row_5_p_0_x_14276_y_39575_t_65_m_75_w_512_h_512-PixelCAM-SAT_logits_plot.png} \\

    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-SAT.png} \\
    &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-src-p_f_ts-i-ts_027.tif_reg_0_row_4_p_11_x_62144_y_125527_t_85_m_64_w_512_h_512-PixelCAM-SAT_logits_plot.png} \\
  \end{tabular}
  \caption{Standard WSOL setup: For the three \textbf{cancerous} images from \camsixteen, we display the t-SNE projection of foreground and background pixel-features of WSOL baseline methods without (1st row) and with (2nd row) \pixelcam. The 3rd row presents the logits of the pixel classifier of \pixelcam.
  }
  \label{fig:feature-tsne-source-cancer-cam16}
\end{figure}

\begin{figure}[!ht]
\centering
  \begin{tabular}{@{}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{}}
    \makecell{Input} & \makecell{DeepMil} & \makecell{GradCAM++} & \makecell{LayerCAM} & \makecell{SAT} \\
   \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-testA_38.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-SAT.png} \\

     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_38-PixelCAM-SAT_logits_plot.png} \\
    
    
     \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-testB_14.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testB_14-PixelCAM-SAT_logits_plot.png} \\

    

    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-testA_26.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-SAT.png} \\
    
    &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-SAT.png} \\

     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-cancer-target-testA_26-PixelCAM-SAT_logits_plot.png} \\
    

   
  \end{tabular}
  \caption{
  OOD setup (\mbox{\camsixteen $\rightarrow$ \glas}): For the three \textbf{cancerous} images from \glas, we display the t-SNE projection of foreground and background pixel-features of WSOL baseline methods without (1st row) and with (2nd row) \pixelcam. The 3rd row presents the logits of the pixel classifier of \pixelcam.
  }
  \label{fig:feature-tsne-target-cancer-glas}
\end{figure}

\begin{figure}[!ht]
\centering
  \begin{tabular}{@{}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{}}
    \makecell{Input} & \makecell{DeepMil} & \makecell{GradCAM++} & \makecell{LayerCAM} & \makecell{SAT} \\
   \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-testA_35.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-SAT.png} \\

     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_35-PixelCAM-SAT_logits_plot.png} \\
    
    
     \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-testA_49.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_49-PixelCAM-SAT_logits_plot.png} \\

    

    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-testA_2.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-SAT.png} \\
    
    &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-SAT.png} \\

     &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{visualization-GLAS-normal-target-testA_2-PixelCAM-SAT_logits_plot.png} \\
       
  \end{tabular}
  \caption{
  OOD setup (\mbox{\camsixteen $\rightarrow$ \glas}): For the three \textbf{normal} images from \glas, we display the t-SNE projection of foreground and background pixel-features of WSOL baseline methods without (1st row) and with (2nd row) \pixelcam. The 3rd row presents the logits of the pixel classifier of \pixelcam.
  }
  \label{fig:feature-tsne-target-normal-glas}
\end{figure}

\clearpage

\begin{figure}[!htp]
\centering
  \begin{tabular}{@{}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{\hskip 10pt}c@{}}
    \makecell{Input} & \makecell{DeepMil} & \makecell{GradCAM++} & \makecell{LayerCAM} & \makecell{SAT} \\
   \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-SAT.png} \\

     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_090.tif_reg_0_row_10_p_2_x_32487_y_61454_t_100_m_75_w_512_h_512-PixelCAM-SAT_logits_plot.png} \\
    
    
     \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-SAT.png} \\
     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_073.tif_reg_5_row_8_p_17_x_16488_y_75686_t_51_m_69_w_512_h_512-PixelCAM-SAT_logits_plot.png} \\

    

    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-DeepMIL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-GradCAMpp.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-LayerCAM.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-SAT.png} \\
    
    &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-DL.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-GC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-LC.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-SAT.png} \\

     &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-DL_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-GC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-LC_logits_plot.png} &
    \includegraphics[width=.12\textwidth]{vis-cam16-cn-trg-p_f_ts-i-ts_021.tif_reg_0_row_26_p_1_x_53503_y_99352_t_59_m_56_w_512_h_512-PixelCAM-SAT_logits_plot.png} \\
       
  \end{tabular}
  \caption{
  OOD setup (\mbox{\glas $\rightarrow$ \camsixteen}): For the three \textbf{cancerous} images from \camsixteen, we display the t-SNE projection of foreground and background pixel-features of WSOL baseline methods without (1st row) and with (2nd row) \pixelcam. The 3rd row presents the logits of the pixel classifier of \pixelcam.
  }
  \label{fig:feature-tsne-target-cancer-cam16}
\end{figure}
