\subsection{Datasets}
\label{sec:datasets}
\paragraph{CAMELYON16}~\cite{ehteshami_bejnordi_diagnostic_2017} consists of 399 WSIs of sentinel lymph node tissue sections derived from women with breast cancer. The dataset is split into a training set of 270 images and a test set of 129 images. Collected from two medical centers in the Netherlands, it includes exhaustive pixel-level annotations of metastatic regions (both macrometastases and micrometastases) verified by expert pathologists. We use TRIDENT~\cite{vaidya_molecular-driven_2025, zhang_accelerating_2025} to segment and patch the WSIs at 10$\times$ magnification~\cite{mammadov_self-supervision_2025} into 224$\times$224 non-overlapping patches and utilize the UNIv1~\cite{chen_towards_2024} encoder for feature extraction. Similarly to Lu et al.~\cite{lu_data-efficient_2021}, we follow a 10-fold cross-validation protocol and report mean bag-level performance.

\paragraph{TCGA-NSCLC} We use the dataset from The Cancer Genome Atlas (TCGA) program for the non-small cell lung carcinoma (NSCLC) subtyping task~\cite{cooper_pancancer_2018, campbell_distinct_2016}. The dataset consists of Hematoxylin and Eosin (H\&E) stained WSIs in 2 distinct cohorts: Lung Adenocarcinoma (TCGA-LUAD) and Lung Squamous Cell Carcinoma (TCGA-LUSC) \cite{campbell_distinct_2016, cooper_pancancer_2018}. Specifically, we use 494 LUAD and 512 LUSC cases for a total of 1,006 slides, segment and patch at 10$\times$ magnification \cite{mammadov_self-supervision_2025} into 224$\times$224 non-overlapping patches and use the UNIv1~\cite{chen_towards_2024} encoder for feature extraction. Performance is reported over 4 folds.

\paragraph{PANDA}~\cite{panda_dataset} is derived from the MICCAI 2020 Prostate Cancer Grade Assessment challenge and comprises 10,609 WSIs from prostate core needle biopsies annotated, providing slide-level Gleason scores and ISUP grades alongside expert tissue annotations. We address ISUP grading (0-5) as a 6-class classification task and follow a 5-fold cross-validation protocol using stratified splits, with each fold containing approximately 80\% of samples for training, 5\% for validation and 15\% for testing. We segment the WSIs into non-overlapping patches of size 224$\times$224 pixels at 20$\times$ magnification \cite{song_morphological_2024} and use the UNIv1~\cite{chen_towards_2024} encoder for feature extraction.

\paragraph{BRACS}~\cite{brancati_bracs_2022} dataset comprises 547 H\&E stained WSIs and over 4,500 annotated regions of interest derived from 189 patients, designed to advance the automatic detection of challenging "atypical" (precancerous) lesions that are often underrepresented in other public datasets. It is annotated into seven histological subtypes, grouped into three main categories: Benign (Normal, Pathological Benign, Usual Ductal Hyperplasia), Atypical (Flat Epithelial Atypia, Atypical Ductal Hyperplasia), and Malignant (Ductal Carcinoma in Situ, Invasive Carcinoma). We specifically focus on coarse classification into the three main categories (3-class classification), using the train/validation/test split provided with the dataset. We segment the WSIs into non-overlapping patches of size 224$\times$224 pixels at 20$\times$ magnification \cite{song_morphological_2024}, and use the UNIv1~\cite{chen_towards_2024} encoder for feature extraction. Performance is reported over 5 seeds.



\subsection{Implementation Details}
\label{sec:implementation_details}
All models are trained and evaluated in Python with PyTorch, using the same PyTorch Lightning
training pipeline with identical data loading, batching, and hardware configurations.
For \ours, training is performed using the standard cross-entropy loss on slide-level labels,
while competing methods are trained using the loss functions specified in their original works.

The models are trained for a maximum of 30 epochs on a single A100 GPU, using full-precision (FP32) arithmetic, except MeanMIL which is trained for a maximum of 50 epochs to obtain convergence. \ours\ optimization employs AdamW with a base learning rate of $2\times10^{-4}$, weight decay of $1\times10^{-5}$, and momentum parameter $0.9$. We use a cosine annealing learning-rate schedule, with a 6-epoch warm-up phase starting at $1\times10^{-5}$, and minimum learning rate of $1\times10^{-7}$. Early stopping was governed by a patience of 20 epochs and a performance threshold of $10^{-4}$. Learning-rate dynamics were logged every epoch, with explicit tracking of weight-decay values to enable fine-grained monitoring of the training process.

Model-specific hyperparameters and optimizer choices for competitors follow the respective original papers and are selected to ensure stable convergence based on observed loss curves. FLOPs are reported for a single forward pass during evaluation and a dummy input bag of 1000 patch embeddings and serve as a proxy for algorithmic complexity. Wall-clock training and inference times measure end-to-end execution. We additionally report peak GPU utilization as an implementation-level efficiency metric reflecting how effectively each model translates computation into hardware usage under identical experimental conditions. Complete code and instructions are publicly available at \url{https://github.com/mandlos/CAPRMIL}.

\subsection{Evaluation of context-aware tokens}
\begin{figure}[ht]
    \centering
    \includegraphics[width=1\linewidth]{Figures/test_001_heatmap_alternative.pdf}
    \caption{\textbf{Token--patch assignment heatmaps.} Test slide from CAMELYON16. \textit{Top:} Soft assignment weights from one \ours\ attention head, indicating each patch's contribution to the $M$ context-aware tokens. \textit{Bottom:} Top-$8$ patches per token ranked by assignment score, highlighting the dominant morphological patterns contributing to each token.}
    \label{fig:test_001_heatmap}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=1\linewidth]{Figures/test_021_heatmap_alternative.pdf}
    \caption{\textbf{Token--patch assignment heatmaps.} Test slide from CAMELYON16. \textit{Top:} Soft assignment weights from one \ours\ attention head, indicating each patch's contribution to the $M$ context-aware tokens. \textit{Bottom:} Top-$8$ patches per token ranked by assignment score, highlighting the dominant morphological patterns contributing to each token.}
    \label{fig:test_021_heatmap}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=1\linewidth]{Figures/test_075_heatmap_alternative.pdf}
    \caption{\textbf{Token--patch assignment heatmaps.} Test slide from CAMELYON16. \textit{Top:} Soft assignment weights from one \ours\ attention head, indicating each patch's contribution to the $M$ context-aware tokens. \textit{Bottom:} Top-$8$ patches per token ranked by assignment score, highlighting the dominant morphological patterns contributing to each token.}
    \label{fig:test_075_heatmap}
\end{figure}