% Small paragraph on datasets and one sentence to refer that the implementation is on appendix
\paragraph{Datasets, Tasks and Evaluation Metrics.}
We evaluate our approach on four WSI benchmarks: \textbf{CAMELYON16}~\cite{ehteshami_bejnordi_diagnostic_2017}
for tumor detection, \textbf{TCGA-NSCLC}~\cite{cooper_pancancer_2018,campbell_distinct_2016}
for lung cancer subtyping, \textbf{BRACS}~\cite{brancati_bracs_2022}
for coarse breast lesion classification, and \textbf{PANDA}~\cite{panda_dataset}
for prostate ISUP grading. Dataset-specific evaluation protocols, metrics, and implementation
details are provided in the Appendix. We report slide-level classification performance and
calibration using area under the curve (\textbf{AUC}) and adaptive expected calibration error
(\textbf{ACE})~\cite{nixon_measuring_2019}.



% Bag level classificaiton table
\subsection{Slide-level Performance and Parameter Efficiency}

\input{Tables/bag_level_classification_table}

\ours\ achieves competitive AUC and calibration relative to state-of-the-art MIL methods
(Table~\ref{tab:bag-level-performance_uni}), with performance differences consistently
within one standard deviation, while being substantially more parameter efficient.
Notably, these results are obtained using simple mean aggregation, underlining the
strength of the learned context-aware patch representations. Across datasets, \ours\ matches parameter-efficient methods such as ABMIL, CLAM and PAMIL on CAMELYON16 and
TCGA-NSCLC, and ranks among the top-performing approaches on multiclass tasks including
PANDA and BRACS. At the same time, \ours\ reduces trainable parameters by approximately
48\% relative to ABMIL, and by up to 88\% and 92.8\% compared to transformer-based methods
such as TransMIL and DGRMIL, respectively. In terms of efficiency, \ours\ reduces FLOPs during inference by 52\% to over 99\% compared to ABMIL, TransMIL, and DGRMIL.

In contrast, naive mean aggregation ---using a linear projection, mean pooling, and a
linear classifier--- degrades substantially on large-bag tasks. The Mean
baseline underperforms by 28.2 AUC points on CAMELYON16 and 11.2 AUC points on BRACS, where bags contain 4k--20k instances. On PANDA (average bag size $\sim$500), mean pooling performs comparably to other methods, with a similar trend on TCGA-NSCLC. Overall, these results indicate that context-aware tokenization is critical for maintaining discriminative capacity while enabling a parameter- and computation-efficient formulation.


% 
\begin{figure}[t]
    \centering
    \includegraphics[width=1\linewidth]{Figures/test_016_heatmap_tsmil.pdf}
    \caption{\textbf{Token--patch assignment heatmaps.} Test slide from CAMELYON16. \textit{Top:} Soft assignment weights from one \ours\ attention head, indicating each patch's contribution to the $M$ context-aware tokens. \textit{Bottom:} Top-$8$ patches per token ranked by assignment score, highlighting dominant morphological patterns for each token.}
    \label{fig:test_016_heatmap}
\end{figure}

As seen in Figure~\ref{fig:test_016_heatmap} and Appendix~A Figures~\ref{fig:test_001_heatmap}--\ref{fig:test_075_heatmap}, we observe that tokens tend to aggregate patches with visually coherent histological patterns such as adipose-rich or epithelial-dominant regions, while de-emphasizing unrelated tissue types. This observation is further supported by a cell-level analysis (Appendix~D, Figures~\ref{fig:cells_test001}--\ref{fig:cells_test021}), showing that token-assigned regions exhibit distinct cellular composition profiles. The top-$k$ assigned patches per token indicate that a limited subset of instances dominates each token's 
construction. 
Specifically in Figure~\ref{fig:test_016_heatmap}, Token~1 predominantly captures adipose-rich regions, as confirmed by their
low cellular content in the top-8 assigned patches. Tokens~2 and~3 focus on tumor-related
tissue, with Token~2 aggregating malignant epithelial regions, while Token~3 captures
stromal or tumor-associated connective tissue. Finally, Token~4 primarily represents benign
tissue, with patches exhibiting more homogeneous cellular organization.




\subsection{Computational and Memory Efficiency}

\begin{figure}[h]
    \centering
    \subfigure[
    % \textbf{Model efficiency plot}. GPU memory footprint vs.\ training time.
    ]{
        \includegraphics[width=0.46\linewidth]{Figures/model_efficiency_bubble_plot_pamil.png}
        \label{fig:model_efficiency_scatterplot}
    }
    \hfill
    \subfigure[
    % \textbf{Model efficiency vs performance plot}. GPU memory footprint vs.\ ACC.
    ]{
        \includegraphics[width=0.46\linewidth]{Figures/model_efficiency_performance_bubble_plot_acc_pamil.png}       \label{fig:model_efficiency_vs_performance_scatterplot}
    }
    \caption{\textbf{Model efficiency analysis}.
    ($a$) GPU memory footprint (peak during training, averaged over 30 epochs) vs.\ training time
    (entire training set, averaged over 30 epochs).
    ($b$) GPU memory footprint vs.\ ACC.
    Marker size denotes the number of trainable parameters.}
    \label{fig:model_efficiency_combined}
\end{figure}

While parameter count and FLOPs provide useful proxies for model efficiency, practical deployment at whole-slide scale additionally depends on empirical resource utilization. Figure~\ref{fig:model_efficiency_combined} analyzes this by relating peak GPU memory usage, training time, and slide-level performance across competing MIL methods. As shown in Figure~\ref{fig:model_efficiency_combined}(a), \ours\ exhibits a substantially lower memory footprint and shorter training time compared to transformer-based approaches, reflecting its linear-scaling design and low-dimensional intermediate representations. \ours\ remains competitive with more computationally demanding models despite its resource-efficiency, by operating on rich, context-aware patch embeddings.
Figure~\ref{fig:model_efficiency_combined}(b) further illustrates the trade-off between accuracy and memory consumption. \ours\ achieves high balanced accuracy, with performance differences consistently within one standard deviation of leading competitors, while operating under a significantly smaller GPU memory budget. In contrast, transformer-based methods such as TransMIL and DGRMIL incur large memory overheads for only marginal performance gains. While attention-based and probabilistic MIL methods offer stronger aggregation modules, they do so with increased computational or memory requirements, suggesting that \ours\ context-aware representations provide a lightweight yet competitive alternative.



\subsection{Ablation studies}


\paragraph{Clusters and heads.}
Varying the number of clusters $M$ while fixing $H=8$ and the MLP ratio to 4 shows stable performance across a wide range of values ($M\in\{2,4,8,16\}$), with no consistent gains from increasing the number of clusters beyond small to moderate values (Appendix Table~\ref{tab:tsmil_ablations_main}, top). Similarly, increasing the number of attention heads $H$ improves performance from 2 to 8 heads but saturates thereafter, with no clear benefit on larger values (Appendix Table~\ref{tab:tsmil_ablations_main}, middle). Based on these trends, we adopt $M=4$ and $H=8$ as balanced choices that provide sufficient contextual capacity without unnecessary complexity.

\paragraph{MLP expansion ratio.}
Ablating the MLP expansion ratio with fixed $M=4$ and $H=8$ indicates that smaller ratio slightly degrades performance, while larger ratio yields more consistent results across metrics (Appendix Table~\ref{tab:tsmil_ablations_main}, bottom). We therefore use an MLP ratio of 4 in all experiments. Overall, these ablations indicate that the selected configuration ($M=4$, $H=8$, MLP ratio $=4$) offers a robust trade-off between representational capacity and efficiency and is not sensitive to precise hyperparameter choices.

\paragraph{Practical benefits vs.\ ABMIL in low-data regimes.}
Beyond matching ABMIL under the full training set, \ours\ offers a clear practical advantage when labeled data are limited by shifting modeling capacity from the aggregation head to the instance representation. Table~\ref{tab:panda_lowdata_abmil_caprmil} compares ABMIL and \ours\ on the multiclass PANDA benchmark when trained on progressively smaller fractions of the training set. Across most data fractions, \ours\ consistently achieves lower calibration error (ACE), higher accuracy (ACC), and Cohen's $\kappa$, with the largest gains observed in the 10--50\% training regimes. In particular, \ours\ improves ACC by $3.5\%$, $4.7\%$, and $4.3\%$ at 10\%, 25\%, and 50\% of the training data, respectively. For $\kappa$, the primary evaluation metric for this task, \ours\ outperforms ABMIL by $1.9\%$, $2.0\%$, and $2.7\%$ in the same low-data regime, performs comparably at 75\%, and again improves upon ABMIL under the full training set. In addition, \ours\ yields consistently lower ACE across all training fractions, indicating more reliable calibration. These results complement our efficiency analysis: while \ours\ introduces a modest overhead during representation construction, it reduces inference FLOPs relative to ABMIL (Table~\ref{tab:bag-level-performance_uni}) and exhibits improved generalization and calibration when data are scarce. Overall, this suggests that learning context-aware patch representations prior to pooling provides a more data-efficient alternative to concentrating model capacity solely in the aggregation stage.

\input{Tables/panda_lowdata_abmil_caprmil}

\paragraph{Modularity and aggregation robustness.} Table~\ref{tab:modularity_uni} compares different MIL aggregation strategies learned together with \ours\ representations.
In contrast to prior MIL approaches that rely heavily on sophisticated attention pooling, we observe that replacing mean aggregation with attention or gated attention leads to broadly comparable performance across datasets, within one standard deviation. Notably, on more challenging multiclass tasks such as PANDA and BRACS, attention-based aggregators yield a performance increase from $0.8\%$ up to $2.4\%$ respectively, suggesting that additional aggregation capacity may be beneficial in more complex settings.
Overall, these results indicate that the \ours\ block already encodes most of the relevant contextual and discriminative information at the patch level, rendering the choice of final aggregation largely non-critical for performance.
While attention-based aggregators introduce increased parameterization, they do not provide consistent gains across all tasks, highlighting diminishing returns once strong instance representations are learned.
These findings underline the modularity of \ours\ and demonstrate that competitive performance can be achieved with simple, parameter-efficient aggregation, while still allowing the flexibility to incorporate more expressive MIL heads when task complexity demands it.

\input{Tables/modularity_table_uni}
