In this work, we propose \ours, a novel and efficient MIL framework, designed to overcome the limitations of standard attention in WSI analysis. Unlike prior methods that treat patches as isolated units, \ours\ projects patch embeddings into morphology units via soft clustering, and aggregates them into a compact set of context-aware, low-dimensional global tokens, over which self-attention is performed. Global contextual information is then propagated back to the patch embeddings via context broadcasting.
By attending to the tokens rather than patch embeddings, \ours\ achieves linear scaling with respect to the bag size, while maintaining strong representational capacity and high parameter efficiency.

\subsection{Model Architecture}

The \ours\ framework for WSI consists of three sequential stages (Figure~\ref{fig:\ours_framework}): (1) an initial projection of WSI patches into patch embeddings using a pre-trained encoder as frozen backbone, (2) a stack of \ours\ Blocks that use multi-head self-attention over global token representations to produce context/morphology-aware patch embeddings, and (3) a final MIL aggregation and classification head to produce the slide-level prediction. 

\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{Figures/model_Architecture.pdf}
\caption{\ours\ Framework. The WSI is tessellated into patches which are encoded into patch embeddings using a frozen backbone. After a linear projection, $T$ consecutive \ours\ Blocks augment global context to yield context-aware patch embeddings. A MIL aggregator and classifier then produce the final slide-level prediction.
}
\label{fig:\ours_framework}
\end{figure}

\subsubsection{Feature Projection}
A batch of WSIs is represented as a batch of bags $X \in \mathbb{R}^{B \times N \times D_{in}}$ of $N$ patch embeddings of dimension $D_{in}$, with batch size $B$. These embeddings are projected into a latent space of dimension $D \ll D_{in}$ via a learnable linear layer followed by Layer Normalization (LN), GELU activation, and Dropout, yielding patch representations $\mathbf{H}^{(0)}$ as input to the first \ours\ Block:

$$
\mathbf{H}^{(0)}
= \text{Dropout}(\text{GELU}(\text{LN}(\text{Linear}(\mathbf{X})))) \in \mathbb{R}^{B \times N \times D}
$$

\subsubsection{The \ours\ Block}
To capture high-order correlations without the quadratic cost of standard self-attention,
the \ours\ Block adopts the Transolver architecture, performing attention over
low-dimensional, context-aware global tokens to achieve linear complexity with respect
to the bag size. As illustrated in Figure~\ref{fig:\ours_block_and_attention}a, it follows a
Transformer encoder-style design with $H$ \ours\ heads and shared projection matrices across heads, augmenting patch
embeddings with global context to produce rich, morphology-aware representations, formulated as:
%\noindent The \ours block follows a pre-norm residual design:
$$
\mathbf{H}' = \mathbf{H}^{(l-1)} + \text{Dropout}(\text{\ours\ }\text{Attention}(\text{LN}(\mathbf{H}^{(l-1)})))
$$
$$
\mathbf{H}^{(l)} = \mathbf{H}' + \text{Dropout}(\text{MLP}(\text{LN}(\mathbf{H}')))
$$
for $l \in [1, T]$, for $T$ consecutive \ours\ Blocks. The MLP comprises two linear layers with GELU activation. \ours\ attention produces context-aware patch representations by aggregating information across multiple attention heads.

\begin{figure}[t]
  \centering
  \begin{minipage}[t]{0.19\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/model_block.pdf}
    {\small (a) \ours\ Block}
  \end{minipage}
  \hfill
  \begin{minipage}[t]{0.78\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/model_attention.pdf}
    {\small (b) \ours\ Attention head}
  \end{minipage}
  \caption{(a) The \ours\ Block follows a transformer encoder architecture with multihead self-attention. Each \ours\ Block contains $H$ \ours\ Attention heads and their outputs are concatenated. Skip connections are implemented after every Dropout. (b) The \ours\ Attention head projects patch embeddings into $M$ clusters, via soft clustering. Each cluster is aggregated into a context-aware token and attention is applied to the set of $M$ tokens. The refined tokens are projected back to the input latent space via context broadcasting.
  }
  \label{fig:\ours_block_and_attention}
\end{figure}

\subsubsection{\ours\ Attention head}
\ours\ adopts the Physics-Attention mechanism from Transolver to enable efficient correlation
learning on large-scale inputs. As illustrated in Figure~\ref{fig:\ours_block_and_attention}b, it operates in four stages: (1) soft clustering of patch
representations into morphology-aware clusters, (2) aggregation into morphology-aware tokens,
(3) self-attention over these tokens, and (4) broadcasting the refined tokens back to the
input space to produce context-aware patch representations.

\noindent\textbf{Soft Clustering.}
Intuitively, this step compresses the $N$ patch embeddings into a compact set of $M \ll N$ morphology-aware tokens using soft assignment. Let $\mathbf{H} \in \mathbb{R}^{B \times N \times D}$ denote the input patch representations.
$\mathbf{H}$ is mapped via two learnable projections into 
$
\mathbf{x},\, \mathbf{f} \in
\mathbb{R}^{B \times N \times (H D_{\text{head}})},
\quad D_{\text{head}} = D / H,
$
by linear layers
$\mathbf{W}_x, \mathbf{W}_f \in \mathbb{R}^{D \times (H D_{\text{head}})}$
and reshaped into $\mathbf{\tilde{x}},\mathbf{\tilde{f}}\in\mathbb{R}^{B\times H\times N\times D_{\text{head}}}$. 
All $N$ patch embeddings are then softly assigned to $M$ context-aware clusters
per head. A learnable projection
$\mathbf{W}_{\text{cluster}} \in \mathbb{R}^{D_{\text{head}} \times M}$,
initialized orthogonally, produces:
\[
\mathbf{W} = \text{Softmax}_m\left(\frac{\mathbf{\tilde{x}} \mathbf{W}_{\text{cluster}}}{\tau}\right)
% \quad
% \sum_{m=1}^{M} W_{b,h,n,m} = 1 \ \forall\, b,h,n,
\]
\noindent
The softmax is applied along the cluster dimension, such that the assignment weights form a valid patch-to-cluster probability score, meaning that $\sum_{m=1}^{M} W_{b,h,n,m} = 1$. Here, $\mathbf{W} \in \mathbb{R}^{B \times H \times N \times M}$ denotes the soft assignment matrix. The number of clusters is controlled by $M$, while the learnable positive temperature $\tau \in \mathbb{R}_{+}^H$ regulates the assignment entropy of each attention head. Cluster-specific tokens $\mathbf{S} \in \mathbb{R}^{B \times H \times M \times D_{\text{head}}}$ are then computed as weighted combinations of the input embeddings:
\[
\mathbf{S}_{b,h,m,d}
=
\frac{\sum_{n=1}^{N} W_{b,h,n,m}\, \mathbf{\tilde{f}}_{b,h,n,d}}
     {\sum_{n=1}^{N} W_{b,h,n,m} + \varepsilon}%\in \mathbb{R}^{D_{\text{head}}}
% \quad
% \text{with  }\mathbf{S} \in \mathbb{R}^{B \times H \times M \times D_{\text{head}}}
\]



\noindent\textbf{Self-Attention.}
To achieve linear scaling with respect to the bag size, \ours\ performs self-attention over a compact set of $M \ll N$ morphology-aware tokens, obtained via soft clustering and aggregation, rather than over individual patches. Given $M$ morphology-aware tokens $\mathbf{S}$ per head, we apply Multi-Head Self-Attention. Query ($\mathbf{Q}$), Key ($\mathbf{K}$), and Value ($\mathbf{V}$) are obtained via shared linear projections $\mathbf{W}_{q}, \mathbf{W}_{k}, \mathbf{W}_{v}\in\mathbb{R}^{D_{head}\times D_{head}}$ of the head-wise token embeddings $\mathbf{S}$. Attention is then given by:
\[
\text{Attn} = \text{Softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{D_{\text{head}}}}\right),
\quad \mathbf{S}' = \text{Dropout}(\text{Attn} \cdot \mathbf{V})
\]
with $\mathbf{S}'\in \mathbb{R}^{B \times H \times M \times D_{head}}$. Since ($M \ll N$), applying the attention operator over the context-aware tokens ---instead of the $N$ patch embeddings--- reduces computational complexity and allows the model to scale linearly with the input. Because tokens aggregate global context, \ours\ learns meaningful correlations, beyond local spatial features.

\noindent\textbf{Context Broadcasting.}
The refined tokens $\mathbf{S}'$ are broadcast back to the input latent space using the same
assignment weights $\mathbf{W}$ from the soft clustering step, reconstructing each patch
representation as a weighted combination of updated tokens:
\[
\mathbf{O}_{b,h,n,d}
=
\sum_{m=1}^{M} \mathbf{S}'_{b,h,m,d}\, \mathbf{W}_{b,h,n,m},
\quad
\mathbf{O} \in \mathbb{R}^{B \times H \times N \times D_{\text{head}}}.
\]
Head-wise representations are concatenated into $\mathbf{H}^{(T)}\in\mathbb{R}^{B\times N\times (HD_{head})}$ and linearly projected to the
model dimension, yielding the final context-aware patch representations.

% \subsubsection{Aggregation and Prediction}
% After $T$ \ours\ Blocks, the patch representations $\mathbf{H}^{(T)}$ are mean-pooled to form a slide-level embedding $\mathbf{z} = \frac{1}{N} \sum_{n=1}^{N} \mathbf{H}^{(T)}_n \in \mathbb{R}^{B \times D}$. 
% A final linear classifier projects $\mathbf{z}$ to the target class logits depending on the task.

\subsubsection{Aggregation and Prediction}
After $T$ \ours\ Blocks, the context-aware patch representations
$\mathbf{H}^{(T)} \in \mathbb{R}^{B \times N \times D}$ are aggregated into a slide-level
embedding $\mathbf{z} \in \mathbb{R}^{B \times 1\times D}$ using an MIL aggregator $\mathcal{A}(\cdot)$:
\[
\mathbf{z} = \mathcal{A}\big(\mathbf{H}^{(T)}\big)
\]
In practice, $\mathcal{A}$ may correspond to a simple non-parametric pooling operator, such as mean or max pooling, or to more expressive attention-based or gated-attention
aggregators. Importantly, these choices do not affect the structure of the \ours\ blocks and can be interchanged without modification; formal definitions of the aggregation functions are provided in Appendix~\ref{app:aggregation}.
A final linear classifier then maps $\mathbf{z}$ to task-specific class logits.

\subsection{Computational Efficiency}
\ours\ addresses the ``curse of dimensionality'' by decoupling the sequence length $N$ from the attention mechanism. 
Since the attention operator displays quadratic computational complexity, attending to all $N$ patch embeddings would yield $O(N^2)$ complexity. \ours\ Attention instead attends to the $M$ context-aware tokens, achieving an overall complexity of $O(MND + M^2D)$. Given that the number of tokens $M$ is a constant with $M\ll N$, the model achieves linear computational complexity with respect to the input size $N$, making it ideal to model long sequences.

