
\paragraph{MIL-based frameworks for digital pathology.} 
During the last years, many different MIL settings have been introduced and extensively tested in different settings~\cite{miltransfer_mahmood_icml}. Depending on the mechanism of aggregation that they are using, they can be grouped into different categories. Among the most popular attention-based methods, we can note ABMIL~\cite{ilse_attention-based_2018}, CLAM~\cite{lu_data-efficient_2021} and DSMIL~\cite{li_dual-stream_2021}. Moreover, TransMIL~\cite{shao_transmil_2021} was among the first to introduce a transformer network specifically for WSI, in order to model both morphological and spatial correlations. Building on top of this, DGRMIL \cite{zhu_dgr-mil_2025} utilizes a set of learnable ``global vectors'' to summarize distinct morphological patterns and computes cross-attention between the instances and these global vectors, effectively achieving linear scaling.
Finally, probabilistic-based MIL methods, such as BayesMIL~\cite{cui_bayes-mil_2023}, argue that standard attention scores are unreliable proxies for interpretability and address this by introducing a probabilistic instance-wise attention module that yields patch-level uncertainty estimates. Similarly, SGPMIL~\cite{lolos_sgpmil_2025} targets the lack of uncertainty estimation in deterministic models by learning a posterior distribution over attention scores, using an input independent inducing set of prototypes.


\paragraph{Attention-based Neural PDE solvers.}
Solving Partial Differential Equations (PDEs) is fundamental to modeling complex phenomena in science and engineering.
While traditional numerical approaches such as the Finite Element Method (FEM) offer high accuracy, they typically require discretization of the domain into high-resolution meshes ---often containing millions of mesh points--- resulting in prohibitive computational costs~\cite{grossmann_can_2024}. Consequently, deep learning-based neural operators have emerged as efficient surrogates, capable of learning the mapping between model state and solution fields directly from data \cite{li_fourier_2020, lu_learning_2021, wu_transolver_2024}.
Transformer architectures have been increasingly utilized in neural PDE solvers due to their ability to model global dependencies \cite{li_transformer_2022}. However, they often face computational bottlenecks due to the quadratic complexity of standard self-attention \cite{katharopoulos_transformers_2020, luo_transolver_2025}. Furthermore, simply applying attention to individual mesh points may fail to capture the intricate high-order physical correlations governing the system, as the model can become overwhelmed by low-level geometric details, thus preventing effective relation learning \cite{wu_flowformer_2022}. 
We identify that challenges inherent to long-sequence processing, such as computational complexity and efficient correlation learning, are common in both large-scale physical simulations and WSI analysis. Surprisingly, to the best of our knowledge, the use of neural PDE solvers has not been explored in digital pathology.

\paragraph{The Transolver Architecture.}
To address the prohibitive computational cost and degraded correlation learning due to the large size of the input, the Transolver architecture was introduced as a Transformer-based PDE solver for general geometries~\cite{wu_transolver_2024}, and later scaled in larger settings~\cite{luo_transolver_2025}. Their architecture introduces Physics-Attention, proposing that a domain discretized to $N$ mesh points can be decomposed into a set of  $M \ll N$ physically consistent clusters (``slices''), which can then be aggregated into ``physics-aware tokens'', forming a compact latent representation of distinct physical states. Standard Multi-Head Self-Attention (MSA) can then be applied to these tokens for correlation modeling with complexity $O(M^2)$, achieving linear scaling with respect to the number of mesh points. By explicitly modeling ``physical states'' rather than individual points, the model becomes more robust to geometric variations and discretization artifacts, while the learned slices have been shown to correspond to meaningful physical regions, enhancing the model's interpretability and generalization capability \cite{wu_transolver_2024, luo_transolver_2025}.
Drawing a parallel to digital pathology, both neural PDE solvers and MIL models face the fundamental challenge of efficiently learning correlations over massive sequences of instances (mesh points in PDEs, patches in WSIs). Viewed through this lens, Transolver's Physics-Attention constitutes a promising approach to facilitate efficient global correlation modeling, by projecting the high-dimensional input space onto a compact set of latent variables. 


\paragraph{Prototype-based Multiple Instance Learning.}
ProtoMIL~\cite{protomil} proposes a self-explainable MIL framework that learns a fixed set of trainable prototype vectors and represents each slide by aggregating, via attention pooling, the maximum L2-similarity between each prototype and its most similar patch embedding, enabling case-based reasoning through explicit prototype--patch matches. However, each prototype is driven by a single maximally activating patch, and the semantic meaning of the learned prototypes is not exhaustively validated.
TPMIL~\cite{yang2023tpmil} refines instance-level features by softly assigning all patch embeddings to trainable prototypes using attention-derived pseudo-labels to better capture intra-class morphological heterogeneity, but this refinement is tightly coupled to a specific attention-based MIL aggregator, limiting architectural flexibility. 
PAMIL~\cite{pamil} similarly employs cross-attention between learnable prototypes and patch embeddings to jointly aggregate instance-level and prototype-level information for slide classification, with prototype refinement again tied to a fixed attention-based aggregator.
Prototype-Based MIL~\cite{sun2025prototype} adopts a two-stage framework in which human-interpretable concepts are first learned via a sparse autoencoder and subsequently aggregated for slide classification using attention-weighted sum pooling and a linear classifier, making overall performance strongly dependent on the quality of the separately learned concept representations.
Taken together, \ours\ is adjacent to these prototype-based and concept-based MIL methods in that it exploits morphological redundancy, but differs fundamentally by learning input-conditioned, softly assigned cluster tokens that are jointly optimized in a single-stage, end-to-end manner and can be seamlessly combined with arbitrary MIL aggregation operators.

% PANTHER~\cite{panther} constructs unsupervised slide representations by fitting a $p$-component mixture model to all patch embeddings and forming slide-level descriptors by concatenating the estimated component statistics.