
Whole Slide Image (WSI) analysis has become the foundation of clinical practice in computational pathology \cite{alkhalaf_integration_2024, wang_advances_2024}, however their sheer size poses a significant challenge for Deep Learning approaches \cite{brixtel_whole_2022, lu_data-efficient_2021, gadermayr_multiple_2024}. At the same time, pixel-level annotations are prohibitively expensive and time-consuming, resulting in clinical datasets that typically provide only slide-level labels rather than fine-grained annotations \cite{lu_data-efficient_2021, song_artificial_2023, gadermayr_multiple_2024}.

%\subsection{MIL for WSI}
To address the computationally prohibitive size of WSIs and the lack of pixel-level annotations, Multiple Instance Learning (MIL) has been established as the standard framework for WSI analysis. The MIL pipeline comprises patch feature extraction, typically adopting pre-trained foundation models \cite{xiong_survey_2025}, followed by aggregation/pooling to produce the slide-level representation for downstream tasks.
In recent years, attention-based mechanisms have emerged as a promising approach for a trainable MIL aggregator \cite{ilse_attention-based_2018, wang_advances_2024, gadermayr_multiple_2024}, due to their impressive correlation learning capabilities.
While effective, approaches that utilize standard attention directly on the patch embeddings face computational bottlenecks due to the quadratic complexity of the attention operator~\cite{shao_transmil_2021}. Attention-based MIL methods for WSI have also been found to be highly susceptible to overfitting and offer limited interpretability \cite{zhang_attention-challenging_2025}, while often lacking principled uncertainty quantification \cite{sun2025prototype, cui_bayes-mil_2023, lolos_sgpmil_2025}, limiting the potential of clinical translation. 
Therefore, developing aggregation strategies that can effectively model instance interactions, handle the challenges inherent to long sequence processing in WSIs, and provide reliable representations remains an active area of research \cite{bilal_aggregation_2023, fang_mammil_2024}.

%\subsection{Attention in Neural PDE solvers}
At the same time, we identify that neural Partial Differential Equation (PDE) solvers \cite{li_fourier_2020, hao_gnot_2023, wu_transolver_2024} face a similar challenge: how to achieve efficient and reliable correlation learning in large-scale inputs. Solving PDEs often includes modeling complex phenomena that may cause long-distance interactions, on domains discretized into millions of mesh points \cite{grossmann_can_2024}. 
Attention-based methods have been used in PDE modeling, but they also face prohibitive computational cost and degraded correlation learning due to the large scale of the input \cite{katharopoulos_transformers_2020, wu_transolver_2024}. Therefore, we argue that ideas that have successfully tackled these problems in the domain of Surrogate PDE solvers could provide new insights in digital pathology.

%\subsection{Contributions}
In this work, we introduce \ours, a novel and efficient attention-based MIL framework for WSI analysis, proposing a paradigm shift by removing the complexity of correlation learning from the MIL aggregator, using context-aware patch representations. 
Following the architecture of Transolver \cite{wu_transolver_2024, luo_transolver_2025}, which shows promising results in efficient PDE modeling, we leverage Multi-Head Self-Attention (MSA) over a small set of global context-aware tokens, achieving linear computational complexity with respect to the input and promoting effective correlation learning on downstream tasks. More precisely, our main contributions are summarized as follows:
\begin{enumerate}[noitemsep, topsep=2pt]
    \item \textbf{We propose a novel and efficient MIL setting based on the Transolver architecture.} Tackling the challenge of the large dimensionality of the input, \ours\ introduces a bottleneck before the attention operator, which consists of: (1) soft clustering of the patch embeddings and (2) aggregating each cluster into a context-aware token. By utilizing MSA over the context-aware tokens, \ours\ achieves linear computational complexity with respect to the bag size and produces rich morphology\slash context-aware patch representations.

    \item \textbf{A highly parameter-efficient formulation.} Our approach performs on par with current state-of-the-art MIL heads, while reducing the total number of trainable parameters by 48\% compared to ABMIL and up to 92.8\% compared to SOTA trans\-former-based MILs. This significantly reduces the computational requirements during training and inference in terms of time, FLOPS, and memory utilization.

    \item \textbf{A scalable, aggregator-agnostic formulation that can be adapted in multiple MIL heads}. Our formulation is independent of the MIL aggregator, and it can be applied in different commonly used MIL settings with small computational overhead.
\end{enumerate}

We evaluate \ours\ on various publicly available computational pathology datasets. Paired with a simple MeanMIL aggregator, our method matches SOTA performance, while achieving leading efficiency, highlighting a highly efficient and adaptable MIL framework.


% \noindent\textbf{(i) We propose a novel and efficient MIL setting based on the Transolver architecture.} Tackling the challenge of the large dimensionality of the input, \ours\ introduces a bottleneck before the attention operator, which consists of: (1) soft clustering of the patch embeddings and (2) aggregating each cluster into a context-aware token. By utilizing MSA over the context-aware tokens, \ours\ achieves linear computational complexity with respect to the bag size and produces rich morphology/context-aware patch representations.\newline
% \textbf{(ii) A highly parameter-efficient formulation.} Our approach performs on par with current state-of-the-art MIL heads, while reducing the total number of trainable parameters by 48\% compared to ABMIL and up to 92.8\% compared to SOTA transformer-based MILs. This significantly reduces the computational requirements during training and inference in terms of time, FLOPS, and memory utilization.\newline
% \textbf{(iii) A scalable, aggregator-agnostic formulation that can be adapted in multiple MIL heads}. Our formulation is independent of the MIL aggregator, and it can be applied in different commonly used MIL settings with small computational overhead.

% We evaluate \ours\ on various publicly available computational pathology datasets. Paired with a simple MeanMIL aggregator, our method matches SOTA performance, while achieving leading efficiency, highlighting a highly efficient and adaptable MIL framework.