\documentclass{midl} % Include author names
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\jmlrvolume{-- 238}
\editors{Accepted for publication at MIDL 2026}

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{booktabs} % for \toprule, \midrule, \bottomrule


\title[MoA: Mixture of Aggregators]{MoA: Mixture of Aggregators Improves Slide-Level Diagnosis in Computational Pathology}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Fatih Ozlugedik\midljointauthortext{Contributed equally}\nametag{$^{1}$}} \Email{fatih.oezluegedik@helmholtz-munich.de}\\
\addr $^{1}$ Institute of AI for Health, Helmholtz Munich, Munich, Germany \\
\Name{Muhammed Furkan Dasdelen\midlotherjointauthor\nametag{$^{1}$}}\orcid{0000-0003-2251-2093} \Email{furkan.dasdelen@helmholtz-munich.de}\\
\Name{Rao Muhammad Umer\nametag{$^{1}$}}\orcid{0000-0001-6179-5829} \Email{Umer.Rao@helmholtz-munich.de}\\
\Name{Carsten Marr\nametag{$^{1,2,3,4}$}}\orcid{0000-0003-2154-4552} \Email{carsten.marr@helmholtz-munich.de}\\
\addr $^{2}$ Department of Medicine III, Ludwig-Maximilian-University Hospital, Munich, Germany \\
\addr $^{3}$ Munich Center for Machine Learning (MCML), Munich, Germany \\ \addr $^{4}$ DKTK, German Cancer Consortium, Munich, Germany
}


\begin{document}

\maketitle


\begin{abstract}
Multiple instance learning (MIL) is the standard for learning slide-level representations from whole slide images (WSIs), typically using a single attention-based aggregator to pool instance features. However, a single aggregator can struggle to capture morphological and compositional patterns of cells in pathology and cytology data, and different diseases may demand different pooling behaviours. We propose a mixture-of-aggregators framework that models complementary aspects of instance distributions in histology and hematologic cytology. A router with top-2 gating dynamically selects the most relevant aggregators per slide, and their outputs are fused into a patient-level representation. To avoid collapse to a single dominant expert aggregator, we add a load-balancing loss and Gumbel noise on the router logits. We evaluate our method on 19 different tasks from 16 datasets including histology and hematologic cytology. Compared to single-aggregator baselines, our approach improves diagnostic prediction accuracy by an average of 4.5\% over ABMIL and 12.6\% over TransMIL across all tasks. Beyond performance, our analysis shows that different aggregators attend to distinct, disease-specific instance distributions, providing interpretable insights into the diagnostic process.
% Multiple instance learning (MIL) has become the standard approach for learning meaningful representations from whole slide images (WSIs) in computational pathology. A variety of aggregation strategies have been proposed to compress instance-level features into slide-level representations, with attention mechanisms often employed to capture diagnostically relevant instances. However, single aggregators struggle to capture the diverse morphological patterns present in heterogeneous pathology and cytology datasets, where different diseases may require distinct aggregation strategies to identify relevant instances. In this work, we introduce a framework that employs a mixture of aggregators to model complementary aspects of instance distributions in histology and cytology images. Through a router with top-2 gating, our architecture dynamically selects the most relevant aggregators for each slide, whose outputs are weighted and later fused into a patient-level representation for classification. To prevent collapse into a single dominant aggregator, we introduce a balancing loss and Gumbel noise on router logits that encourages effective utilization of multiple aggregators during training.


%Thus, our method improves both the performance and interpretability of MIL models used in clinical pathology workflows, influencing diagnostic decision-making.
\end{abstract}

\begin{keywords}
computational pathology, multiple instance learning, cytology
\end{keywords}




\section{Introduction}

\label{sec:intro}
Pathology whole-slide images (WSIs) are indispensable for cancer diagnosis, but their manual assessment is time-consuming and highly dependent on expert experience. With the advent of high-throughput slide digitization, AI-based approaches have been introduced to support tumor detection, grading, morphological and molecular subtyping, and even survival prediction~\cite{chen2024uni,lu2023visuallanguagefoundationmodelcomputational,bilal2021weakly,vorontsov2024foundation,Li2023_histopath_tissueArea_CRCsurvival}. These tools can reduce the workload of pathologists while enabling faster and more standardized diagnostic outputs~\cite{BULTEN2021660,Dy2024,steiner2018impact, janowczyk2019histoqc}. WSIs are extremely large, often reaching gigapixel resolution, yet typically come with only slide-level labels. To bridge this gap, multiple instance learning (MIL) has become the standard paradigm. In MIL, a WSI is partitioned into non-overlapping patches, each patch is encoded into a feature representation, and an aggregator combines these patch-level features into a slide-level embedding for downstream tasks~\cite{li2021dsmil,lu2021clam,campanella2019clinical,ilse2018attention,shao2021transmil,ding2024multimodalslidefoundationmodel}.


The success of MIL depends critically on both the quality of instance encoding and the design of the aggregator. Modern pathology foundation models provide strong instance-level features~\cite{filiot2024phikon,hoptimus1,zimmermann2024virchow2,chen2024uni,lu2024visual}, but aggregation remains challenging~\cite{chen2024uni,ding2024multimodalslidefoundationmodel}. Most current MIL methods focus on identifying a small subset of diagnostically relevant patches, which is effective when the presence of a single pattern is sufficient for diagnosis. However, many diseases are defined not only by the presence of specific cell types or structures, but also by their distribution and relative frequency within the slide. %For example, two conditions may contain similar cell types but differ in the proportions in which they appear. 
Conventional single-aggregator approaches are prone to  fail to capture these subtler distributional patterns, leading to a loss of diagnostically important information~\cite{lu2021clam,li2021dsmil,shao2021transmil}.


Mixture-of-experts strategies, widely adopted in large language models, show that dividing responsibility across multiple specialized components allows the system to model diverse tasks and distributions more effectively~\cite{jacobs1991moe,shazeer2017moe,riquelme2021scalingvisionsparsemixture}. Inspired by this idea, we propose a mixture of aggregators for computational pathology. Within a single pipeline, multiple aggregators can learn complementary aspects of slide composition—some focusing on highly discriminative instances, others capturing broader distributional signals. We hypothesize that such diversity enables the model to represent distinct disease-specific distributions more faithfully, leading to improved diagnostic performance and better alignment with clinical reasoning. The main contributions of our work are: (i) Instead of a single aggregator, we train multiple aggregators in a MIL pipeline using a routing strategy that weights each aggregator’s contribution. (ii) Our pipeline supports diverse aggregator architectures and improves performance. (iii) We show that each aggregator captures distinct, diagnostically relevant, and complementary instance distributions.


\section{Related work}
\label{sec:relatedw}
\subsection{Aggregators in multiple instance learning}
Early multiple instance learning (MIL) approaches for whole-slide images (WSIs) employed non-parametric, permutation-invariant pooling functions---such as mean, max, and log-sum-exp (LSE)---to compress instance features into slide-level representations \cite{campanella2019clinical,inbook_review,keshvarikhojasteh2025quantitativeevaluationmultipleinstance}. While simple and efficient, these fixed functions have limited capacity to adapt to data. 

A major advance beyond static pooling mechanisms was the introduction of Attention-based Multiple Instance Learning (ABMIL)~\cite{ilse2018attention,sadafi2020attention}. ABMIL learns instance-specific attention weights and computes a weighted average of patch embeddings, enabling more flexible slide-level predictions and interpretable instance-level heatmaps. Building on ABMIL, several extensions have been proposed. Clustering-constrained Attention MIL (CLAM) incorporates instance-level clustering to promote diverse class-specific prototypes~\cite{lu2021clam}, while Dual-stream MIL (DSMIL) couples an instance-discriminative stream with a bag-level stream through contrastive alignment~\cite{li2021dsmil}. More recently, Transformer-based aggregators such as TransMIL apply self-attention across patches, explicitly modeling inter-instance relationships and often achieving improved whole-slide accuracy~\cite{shao2021transmil}.

Despite their architectural differences, these MIL approaches share a common bottleneck in how they form the final slide-level representation. In most implementations—including ABMIL, DSMIL, and Transformer-based MIL models—the bag is ultimately reduced to a pooling operation. This is typically realized either through attention pooling or through a classification token (CLS) whose final hidden state $h_{\mathrm{cls}}^{(L)}$ summarizes the entire set. Such readouts are mathematically equivalent to Pooling by Multi-Head Attention (PMA) with $k=1$ in Set Transformers~\cite{lee2019set}, and fall under the Deep Sets formulation~\cite{zaheer2017deepsets}. In essence, the model relies on a single learned query vector that attends over all patches, producing a learned weighted first-order moment of the instance distribution.

While this mechanism effectively captures average signal, it cannot directly model higher-order statistics—such as co-occurrence structures, multimodal feature distributions, or rare-pattern enrichment—which are central to histological heterogeneity and diagnostic accuracy. As a consequence, the architecture implicitly assumes that a slide can be summarized by a single global prototype. This assumption routinely breaks down in heterogeneous whole-slide images, where multiple competing morphologies or subclonal populations may coexist~\cite{zaheer2017deepsets,lee2019set,dosovitskiy2021vit,shao2021transmil}.

CLAM partially addresses this issue by using multi-head attention designed to produce class-specific attention maps, but all heads still share a common backbone, limiting their representational diversity. Transformer-based MIL methods in principle can capture richer distributions via self-attention, yet the quadratic complexity of standard attention becomes prohibitive for thousands of patches. TransMIL mitigates this through hierarchical processing with neighbor-restricted (windowed) attentions and cross-scale fusion, reducing effective complexity while retaining contextual information. However, its hierarchical design introduces permutation variance and reduces interpretability—limitations that are problematic for inherently permutation-invariant domains such as cytology.

These observations collectively suggest that effective slide-level analysis requires multiple specialized aggregation mechanisms that can adapt to different morphological patterns within a slide. This naturally points toward architectures that dynamically select or combine diverse processors rather than relying on a single monolithic pooling mechanism.

% A major advance beyond such static pooling was the introduction of Attention-based Multiple Instance Learning (ABMIL)~\cite{ilse2018attention,sadafi2020attention}. ABMIL learns instance-specific weights and computes a weighted average of features, enabling both more flexible slide-level predictions and interpretable instance-level heatmaps. Building on ABMIL, several variants have been proposed, including Clustering-constrained Attention MIL (CLAM), which introduces instance-level clustering to promote diverse prototypes~\cite{lu2021clam}, and Dual-stream MIL (DSMIL), which couples an instance-discriminative stream with a bag-level stream using contrastive alignment~\cite{li2021dsmil}. More recently, Transformer-based aggregators such as TransMIL extend this line of work by applying self-attention across instances to capture inter-instance correlations before pooling, often leading to improved WSI accuracy~\cite{shao2021transmil}.  

% Despite these advances, the final readout in most implementations remains a permutation-invariant pooling, realized either through attention pooling or through a classification token (CLS) whose final state $h_{\mathrm{cls}}^{(L)}$ summarizes the set. Such mechanisms are mathematically equivalent to Pooling by Multi-Head Attention (PMA) with $k=1$ in set transformers~\cite{lee2019set}, and fall under the deep sets formulation~\cite{zaheer2017deepsets}. Consequently, these readouts primarily act as learned first-moment projections. While effective for capturing average signal, they fail to directly model higher-order statistics such as co-occurrence patterns, which are often critical for histological heterogeneity and diagnostic accuracy. In practice, a single, monolithic aggregator implicitly assumes that the bag of instances can be summarized by one representative prototype, an assumption that frequently breaks down in heterogeneous whole-slide images~\cite{zaheer2017deepsets,lee2019set,dosovitskiy2021vit,shao2021transmil}. 
 
 % These limitations suggest that effective slide-level analysis requires multiple aggregation strategies that can specialize on different morphological patterns within a slide. This naturally points toward architectures that can dynamically select or combine multiple specialized processors.

 \begin{figure*}[t]
  \includegraphics[width=\textwidth]{figures/Fig1_v2.pdf}
  \caption{(A) Whole-slide images are patched and encoded into instance-level embeddings. A router assigns weights and selects the top-2 aggregators, which process the embeddings in parallel. Their weighted outputs are fused into a patient-level latent representation and passed to a classifier for disease prediction. (B) Each aggregator learns to focus on distinct morphological structures within the slide. %Their complementary attention patterns are combined through a weighted mixture, producing a more robust and informative patient-level decision.%Architecture of MoA. (A) Patient whole slide images (pathology or cytology) are patchified and encoded into instance-level embeddings. A router assigns weights and selects the top-2 aggregators, while all aggregators process the embeddings in parallel. The weighted outputs of the selected aggregators are fused into a patient-level latent representation, which is then passed to a classifier for disease prediction. (B) Each aggregator in MoA learns to attend to distinct morphological structures within whole-slide images. They focus on different diagnostically relevant regions of the tissue. Their outputs are combined via a weighted mixture to yield the final patient-level decision. This design leverages complementary attention distributions for more robust predictions. %In this example of patient (IMP-Cervix dataset), one aggregator shows strong attention to the epithelial-stromal interface, while another highlights squamous epithelial features.
  }
  \label{fig:fig1}
\end{figure*}

\subsection{Mixture-of-Experts for specialized modeling}
\label{subsec:moe}
 
The Mixture-of-Experts (MoE) framework is a well-established paradigm for scaling model capacity efficiently by employing a set of specialized sub-networks (``experts") and a learned router that allocates inputs to the most relevant experts. Each expert has its own weight space, enabling specialization without forcing the model to optimize a single compromised solution across partially contradicting objectives \cite{jacobs1991moe,shazeer2017moe}. While MoE has been widely adopted in natural language processing, its application in computational pathology has been limited, particularly at the critical aggregation stage. 

Prior work has explored MoE in computational pathology primarily in multi-task learning settings, where experts are used to share representations across related tasks (e.g., mutation prediction) via task-specific multi-gated routing \cite{LI2025103561}. In this study, we introduce a slide-level MoE approach that combines multiple aggregators under a single router, which we term Mixture of Aggregators (MoA). In contrast to \cite{LI2025103561}, our work focuses on single-task diagnosis and aims to capture slide-level heterogeneity.

Instead of a single aggregator, several permutation-invariant aggregators are trained in parallel, each free to specialize in distinct morphological or domain regimes (for example, immune-rich versus tumor-dominant patterns, rare event sensitivity, or co-occurrence structures). A router produces data-dependent weights (often sparse top-$k$) that combine the aggregator summaries for each slide. This preserves a shared feature backbone while adding specialized capacity exactly where heterogeneity is highest—during the instance-to-slide aggregation. In effect, combining multiple invariant summaries via routing enables the model to approximate a richer family of set functions than any single aggregator, while keeping the interface simple (a slide-level vector) and the training recipe close to standard MIL practice \cite{zaheer2017deepsets,lee2019set,shazeer2017moe,lepikhin2020gshard}.


\section{Methodology}
\label{sec:methods}
    \subsection{MoA: Mixture of aggregators in multiple instance learning}
     Our framework employs multiple aggregators to capture distinct distributions of instances across different disease types. In our mixture of aggregator experiments, we use two commonly adopted aggregator architectures: Attention-based MIL (ABMIL) and a Transformer-based aggregator. Although many variants of aggregators exist and in theory one may appear superior to another, recent studies show that their performance is often data-specific and strongly influenced by hyperparameter tuning \cite{shao2025multipleinstancelearningmodels}. In comparison, we include CLAM, DSMIL and mean pooling (MeanMIL).

     For ABMIL, we use the gated attention pooling mechanism of Ilse et al.~\cite{ilse2018attention}, which learns instance-specific attention weights based on nonlinear gating. The bag representation is then obtained as a weighted sum of the instance embeddings.
        
    As Transformer-based aggregators, we use permutation invariant Transformer proposed by Wagner et al \cite{wagner2023transformer} for cytology tasks. Each instance embedding is projected into a 512-dimensional latent space, and a learnable $[\mathrm{CLS}]$ token is prepended to the sequence. We use two transformer layers, each consisting of a multi-head self-attention block (with 8 heads) followed by a feedforward network with hidden dimension 1024. For pathology tasks, we use TransMIL \cite{shao2021transmil} with same dimensional parameters. $[\mathrm{CLS}]$ token is later used as the aggregated representation.
    

\subsection{Architecture}

    We first use an encoder to extract instance-level features (\autoref{fig:fig1}). For pathology images, we use UNI~\cite{chen2024uni}, and for hematologic cytology we use DinoBloom-B~\cite{koch2024dinobloom}. The extracted features of size (N,D) are passed through a router to compute aggregator weights. The router consists of a projection layer, feature mean pooling and a linear layer. A top-2 softmax gating then selects the most relevant aggregators.
    
    In parallel, the features are passed through all aggregators (\autoref{fig:fig1}). The representations from the top-2 selected aggregators are weighted and summed to produce the patient-level latent representation, which is subsequently used for classification.
    
    % For optimization, we use the standard cross-entropy loss ($\mathcal{L}_{\mathrm{CE}}$) for the slide-level classification task. To prevent the router from collapsing onto a small subset of aggregators, we additionally include a load-balancing auxiliary loss on the gating network. This loss encourages a more uniform allocation of patients across aggregators and follows the formulation introduced in \cite{fedus2021switch}. Let $A$ denote the number of aggregators and $\mathcal{B}$ the effective batch containing $T$ patients routed by the gating network. During training we use top-2 routing, i.e., each bag  is dispatched to the two aggregators with the highest router probabilities. For each aggregator $i \in \{1,\dots,A\}$ we define

    % \begin{align}
    % f_i &= \frac{1}{T}\sum_{x\in\mathcal{B}} \mathbf{1}\!\left[i \in \mathrm{TopK}(p(x),2)\right],\quad
    % P_i = \frac{1}{T}\sum_{x\in\mathcal{B}} p_i(x)
    % \end{align}
    
    % where $p(x) \in \mathbb{R}^A$ is the router’s softmax probability vector for patient $x$, $p_i(x)$ its $i$-th component, and $\mathrm{TopK}(p(x), 2)$ returns the indices of the two aggregators with largest probability for that patient. The quantity $f_i$ is thus the fraction of tokens that are actually routed to aggregator $i$ (hard assignment), while $P_i$ is the average router probability mass assigned to aggregator $i$(soft assignment). The load-balancing loss is then defined as
    % \begin{equation}
    %     \mathcal{L}_{\mathrm{LB}}
    %     = A \sum_{i=1}^{A} f_i \, P_i,\quad
    %      \mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{lb}} \cdot \mathcal{L}_{\mathrm{LB}}
    % \end{equation}
    % The overall training objective combines the classification loss and the load-balancing term with coefficient $\lambda_{\mathrm{lb}}$.

    For optimization, we use standard cross-entropy loss ($\mathcal{L}_{\mathrm{CE}}$) for slide-level classification and add a load-balancing auxiliary loss on the gating network to avoid routing collapse. This loss, following \cite{fedus2021switch}, encourages all aggregators to be used more evenly. Let $A$ be the number of aggregators and $\mathcal{B}$ the effective batch of $T$ patient-level bags. During training, we use top-2 routing, i.e., each bag is sent to the two aggregators with the highest router probabilities. For each aggregator $i \in \{1,\dots,A\}$ we define
    \begin{align}
        f_i &= \frac{1}{T}\sum_{x\in\mathcal{B}} \mathbf{1}\!\left[i \in \mathrm{TopK}(p(x),2)\right],\quad
        P_i = \frac{1}{T}\sum_{x\in\mathcal{B}} p_i(x)
    \end{align}
    where $p(x) \in \mathbb{R}^A$ is the router’s softmax probability vector for patient $x$, $p_i(x)$ its $i$-th component, and $\mathrm{TopK}(p(x), 2)$ returns the indices of the two aggregators with highest probability. Thus, $f_i$ is the fraction of bags actually routed to aggregator $i$ (hard usage), while $P_i$ is the average router probability mass assigned to aggregator $i$ (soft usage). The load-balancing loss is
    \begin{equation}
        \mathcal{L}_{\mathrm{LB}}
        = A \sum_{i=1}^{A} f_i \, P_i,\quad
        \mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{lb}} \cdot \mathcal{L}_{\mathrm{LB}} \, ,
    \end{equation}
    where $\lambda_{\mathrm{lb}}$ controls the strength of load balancing.
    
    In addition, to mitigate aggregator collapse and promote diverse aggregator utilization, we perturb the gating logits with independent Gumbel(0,1) noise during training \cite{shazeer2017moe}. %This stochasticity encourages exploration by allowing tokens to occasionally route to alternative aggregators, especially in the early training stages. 
    To gradually transition from exploration to stable specialization, we anneal the softmax temperature: higher temperatures produce smoother, more exploratory distributions across aggregators, while lower temperatures sharpen the distribution and favor more deterministic routing \cite{nie2022densetosparse}. 
    %This strategy follows the noisy routing principle in mixture-of-expert models \cite{shazeer2017moe} and the annealing-based gating mechanism of Dense-to-Sparse Gate \cite{nie2022densetosparse}, adapted here for our visual aggregation setting in computational pathology. The impact of this strategy have been explored in the ablation studies. 
 



\section{Experiments}
\label{sec:experimental_results}

\subsection{Dataset and preprocessing}

    We test our model on 2 modalities (cytology and histology), 13 organs/regions and 19 tasks including morphological and immune subtyping. All datasets are publicly available. More details are included in Appendix \ref{dataset_detail}.
    
    \textbf{Cytology}: AML-Hehr~\cite{hehr2023explainable} and cAItomorph \cite{dasdelen2026ai} are blood smear datasets which include single cell white blood cell images.
    %includes 189 patients from four acute myeloid leukemia (AML) genetic subtypes 
    %(PML–RARA fusion, NPM1-mutation, CBFB–MYH11 fusion and RUNX1–RUNX1T1 fusion) 
    %and healthy controls. 
    %Each patient has an average of $430 \pm 107$ single-cell white blood cell images. We hold out 43 patients as the test set. 
    %cAItomorph data~\cite{dasdelen2026ai} includes 2,043 patients spanning seven hematologic conditions 
    %—acute myeloid leukemia (AML), myelodysplastic syndromes (MDS), myeloproliferative neoplasms (MPN), MDS/MPN overlap syndromes, lymphoma, plasma cell neoplasms, and reactive changes—
    %along with a healthy cohort. 
    %The number of white blood cell images per patient ranges from 55 to 500, with an average of $488 \pm 55$ cells. A total of 409 patients are held out for testing. This dataset is particularly challenging due to substantial inter-class heterogeneity and intra-class overlap.
    
    \textbf{Pathology}: We include fourteen different pathology datasets in our evaluation. 
    These span multiple organ systems, including breast pathology (BCNB \cite{xu2021predicting}, BRACS \cite{brancati2022bracs}), 
    renal cancer (CPTAC-CCRCC), head and neck cancer (CPTAC-HNSC, HANCOCK \cite{dorrich2025multimodal}), 
    lung cancer (CPTAC-LSCC), pancreatic cancer (CPTAC-PDA), endometrial cancer (CPTAC-UCEC), 
    cervical cancer (IMP-Cervix \cite{oliveira2024imp}) and other (CPTAC-ALL). (\cite{ellis2013clinical,zhang2025standardizing})). Tasks include biomarker prediction (BCNB), immune class and tumor microenvironment prediction (CPTAC-CCRC/HNSC/PDA/UCEC, HANCOCK), histological grading (CPTAC/LSCC, HANCOCK, IMP-CERVIX) and tumor site prediction (CPTAC-ALL, HANCOCK). %Detailed dataset descriptions with number of cases are given in Appendix. 
        
    % The BRACS~\cite{brancati2022bracs} dataset includes 547 H\&E slides from 189 patients  diagnosed with benign, atypical or malignant conditions. CPTAC-ALL~\cite{ellis2013clinical,zhang2025standardizing} includes 2154 whole slide images from 1061 patients and includes ten different organs. The task is to distinguish origin of slide morphology. CPTAC-OV~\cite{CPTAC_OV_2020} has 160 slides from 51 patients. It includes three different immune classes. HANCOCK~\cite{dorrich2025multimodal} is a head and neck cancer dataset. We include 704 H\&E images from 696 patients to predict four different primary tumor site. IMP-Cervix~\cite{oliveira2024imp} dataset consists of 599 H\&E-stained cervical samples obtained from surgical specimens. The samples are categorized into four classes: non-neoplastic, low-grade intraepithelial lesions, high-grade intraepithelial lesions, and other diagnoses. PANDA~\cite{bulten2022artificial} challenge dataset consist of prostate cancer core needle biopsies. For this dataset, following prior publications~\cite{chen2024uni}, we exclude slides with noisy or erroneous labels and include 9555 slides. Each slide is matched with ISUP prostate cancer grading (six classes)

    For all datasets, we fix the test set according to the original publications or the benchmark \cite{zhang2025standardizing,vaidya2025molecular} and report model performances on test set. Within the training split, we perform 5-fold cross-validation. 
    %In each fold, 20\% of the training data is held out as a validation set using stratification by label and, if available, by institute, while the remaining 80\% is used for model training.

%\subsubsection{Preprocessing}
    %For cytology datasets, we use the DinoBloom-B hematology feature extractor~\cite{koch2024dinobloom} and cache the embeddings for subsequent patient-level aggregation. For pathology datasets, we apply the TRIDENT~\cite{zhang2025standardizing,vaidya2025molecular} pipeline to segment and patchify WSIs at 20$\times$ magnification into patches of size $256 \times 256$. These patches are then embedded using the UNI feature extractor~\cite{chen2024uni}.
    For cytology datasets, we use the DinoBloom-B hematology feature extractor~\cite{koch2024dinobloom}. We patchify pathology datasets, using TRIDENT~\cite{zhang2025standardizing,vaidya2025molecular} pipeline at 20$\times$ magnification, patch size of $256 \times 256$. These patches are then embedded using the UNI feature extractor~\cite{chen2024uni}.


\subsection{Evaluation metrics}
    For multi-class tasks, we report balanced accuracy, while for binary tasks, we report the area under the ROC curve (AUROC).
    
    % For performance comparison with the baseline, we use balanced accuracy, defined as:  
    % \begin{equation}
    %     \mathrm{Balanced\ Accuracy} = \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FN_c},
    % \end{equation}
    % where $C$ is the number of classes, $TP_c$ is the number of true positives, and $FN_c$ is the number of false negatives for class $c$. Balanced accuracy corresponds to the average recall across all classes and is particularly useful for datasets with class imbalance.  
    
    To analyze instance-level attentions, we extract the attention scores produced by each aggregator. For ABMIL, we directly use the learned attention weights assigned to individual instances. For Transformer-based aggregators, we apply the Attention Rollout method \cite{abnar2020quantifying}. 
    %Specifically, we first compute the mean attention across heads within each layer. Next, we propagate attention scores from the last layer back to the first by recursively multiplying the attention matrices. Finally, we inspect the attention distribution associated with the [CLS] token.
    %which serves as the bag-level representation and indicates the relative importance of instances.
    
    Jensen–Shannon divergence (JSD) is utilized to quantify attention distribution differences of aggregators \cite{lin2002divergence}.


\begin{table}[t]
\centering
\caption{Comparison of aggregators across datasets. 
%Pathology features are extracted using UNI, and cytology features are extracted using DinoBloom-B. 
$\Delta$ indicates relative improvement compared to single aggregator (\%). The reported metric is balanced accuracy for multi-class, area under the ROC (AUROC) for binary tasks. \textbf{Bold} indicates the best performing model between single vs.\ mixture of aggregators for the same architecture. \underline{Underline} highlights the best model across all models.}
\label{tab:results}
\resizebox{\textwidth}{!}{%
\begin{tabular}{l|ccc|ccc|ccc}
\toprule
Dataset (Number of class) & MoA-ABMIL & ABMIL & $\Delta$ & MoA-TransMIL & TransMIL & $\Delta$ & CLAM-SB & DSMIL & MeanMIL \\
\midrule
\midrule
AML-Hehr (5C) & 78.4$\pm$2.2 & \underline{\textbf{81.5$\pm$3.7}} & -3.8\% & \underline{\textbf{81.5$\pm$1.0}} & 78.6$\pm$2.2 & +3.7\% & 76.7$\pm$6.0 & 41.7$\pm$2.7 & 78.3$\pm$2.1 \\
cAItomorph (8C) & \textbf{52.8$\pm$1.8} & 50.9$\pm$1.9 & +3.7\% & \underline{\textbf{60.1$\pm$1.1}} & 59.4$\pm$1.9 & +1.2\% & 55.0$\pm$3.0 & 42.0$\pm$2.5 & \underline{60.1$\pm$1.4} \\
\midrule
% BC-Therapy/grade & \textbf{67.4$\pm$6.5} & 66.3$\pm$8.2 &  & \textbf{63.8$\pm$6.1} & 60.1$\pm$10.2 &  & \underline{72.1$\pm$3.1} & 57.0$\pm$5.8 & 66.4$\pm$12.7 \\
BCNB/ER (2C) & \textbf{91.3$\pm$0.2} & 91.2$\pm$0.4 & +0.1\% & \textbf{88.4$\pm$0.5} & 85.7$\pm$3.3 & +3.1\% & 90.8$\pm$0.7 & 90.9$\pm$0.5 & \underline{91.7$\pm$0.4} \\
BCNB/HER2 (2C) & \textbf{84.0$\pm$0.7} & 84.0$\pm$0.5 & 0.0\% & \textbf{81.2$\pm$2.2} & 69.8$\pm$3.5 & +16.4\% & 83.1$\pm$2.2 & 81.5$\pm$1.3 & \underline{84.1$\pm$1.1} \\
BCNB/PR (2C) & \textbf{88.3$\pm$0.4} & 88.2$\pm$0.6 & +0.1\% & \textbf{84.7$\pm$0.7} & 78.2$\pm$3.5 & +8.3\% & 87.1$\pm$1.2 & 86.4$\pm$0.4 & \underline{89.0$\pm$0.5} \\
BRACS (7C) & \underline{\textbf{34.6$\pm$1.9}} & 34.4$\pm$1.5 & +0.6\% & \textbf{29.3$\pm$1.6} & 26.5$\pm$1.5 & +10.6\% & 32.8$\pm$3.3 & 27.2$\pm$2.5 & 26.9$\pm$1.7 \\
CPTAC-ALL (10C) & \textbf{96.1$\pm$0.3} & 95.5$\pm$0.2 & +0.6\% & \textbf{96.3$\pm$0.6} & 95.5$\pm$1.0 & +0.8\% & 96.4$\pm$0.7 & 96.5$\pm$0.4 & \underline{96.8$\pm$0.6} \\
% CPTAC_BRCA/PIK3CA_mutation & \textbf{70.8$\pm$11.4} & 68.0$\pm$6.5 &  & \textbf{66.9$\pm$8.5} & 55.9$\pm$14.1 &  & \underline{75.1$\pm$4.9} & 60.6$\pm$15.1 & 60.8$\pm$17.7 \\
CPTAC-CCRCC (3C) & \textbf{45.4$\pm$7.7} & 43.5$\pm$4.3 & +4.4\% & \textbf{47.2$\pm$4.1} & 45.2$\pm$3.7 & +4.4\% & \underline{47.7$\pm$4.0} & 45.4$\pm$3.7 & 45.9$\pm$4.7 \\
% CPTAC_COAD/KRAS_mutation & \textbf{65.9$\pm$3.0} & 61.8$\pm$3.2 &  & \underline{\textbf{74.1$\pm$10.5}} & 61.8$\pm$5.1 &  & 58.6$\pm$6.7 & 66.8$\pm$9.8 & 73.0$\pm$3.9 \\
% CPTAC_COAD/PIK3CA_mutation & \textbf{75.4$\pm$2.2} & 73.2$\pm$5.7 &  & \underline{\textbf{82.8$\pm$8.4}} & 63.7$\pm$12.6 &  & 77.8$\pm$6.3 & 57.2$\pm$14.5 & 69.5$\pm$5.5 \\
% CPTAC_COAD/SETD1B_mutation & \textbf{75.2$\pm$4.8} & 74.4$\pm$6.8 &  & \textbf{85.6$\pm$6.4} & 51.9$\pm$15.4 &  & 76.7$\pm$11.0 & 59.6$\pm$9.4 & \underline{88.5$\pm$4.6} \\
% CPTAC_GBM/EGFR_mutation & \textbf{67.2$\pm$6.0} & 66.3$\pm$8.9 &  & \textbf{67.2$\pm$2.5} & 64.2$\pm$7.5 &  & 60.3$\pm$10.3 & 57.2$\pm$5.8 & \underline{70.5$\pm$5.9} \\
% CPTAC_GBM/TP53_mutation & \textbf{69.3$\pm$2.7} & 67.3$\pm$2.7 &  & \textbf{70.0$\pm$4.0} & 63.2$\pm$5.6 &  & 67.8$\pm$4.5 & 67.7$\pm$1.7 & \underline{71.0$\pm$2.3} \\
CPTAC-HNSC (3C) & \underline{\textbf{35.1$\pm$5.7}} & 33.0$\pm$4.6 & +6.4\% & \textbf{31.6$\pm$3.6} & 27.9$\pm$5.8 & +13.3\% & 35.1$\pm$3.9 & 30.2$\pm$3.0 & 34.5$\pm$5.6 \\
% CPTAC_HNSC/OS & \underline{\textbf{84.8$\pm$1.6}} & 83.5$\pm$1.2 &  & \textbf{80.6$\pm$1.8} & 76.7$\pm$4.1 &  & 82.8$\pm$3.3 & 83.9$\pm$3.2 & 82.9$\pm$1.3 \\
% CPTAC_LSCC/ARID1A_mutation & \underline{\textbf{76.9$\pm$5.9}} & 75.4$\pm$5.0 &  & \textbf{73.1$\pm$6.4} & 51.5$\pm$14.5 &  & 69.1$\pm$9.0 & 44.4$\pm$18.9 & 60.5$\pm$9.0 \\
CPTAC-LSCC (2C) & \underline{\textbf{69.8$\pm$2.2}} & 67.1$\pm$3.5 & +4.0\% & \textbf{63.7$\pm$5.2} & 60.0$\pm$9.4 & +6.2\% & 65.1$\pm$2.3 & 65.0$\pm$2.8 & 60.2$\pm$3.4 \\
% CPTAC_LUAD/KRAS_mutation & \underline{\textbf{81.5$\pm$3.4}} & 81.3$\pm$3.0 &  & \textbf{78.6$\pm$2.3} & 74.9$\pm$7.5 &  & 78.4$\pm$2.8 & 78.5$\pm$1.7 & 75.4$\pm$1.5 \\
CPTAC-PDA (3C) & \textbf{39.3$\pm$7.3} & 35.1$\pm$3.0 & +11.9\% & \underline{\textbf{41.3$\pm$4.1}} & 32.9$\pm$7.0 & +25.5\% & 40.0$\pm$6.1 & 36.2$\pm$3.1 & 40.1$\pm$2.6 \\
% CPTAC_PDA/OS & \textbf{76.2$\pm$4.0} & 73.0$\pm$7.0 &  & \underline{\textbf{84.1$\pm$3.8}} & 66.8$\pm$8.3 &  & 71.2$\pm$6.8 & 74.1$\pm$5.5 & 72.7$\pm$5.5 \\
CPTAC-UCEC (3C) & \textbf{43.2$\pm$2.8} & 36.3$\pm$5.6 & +19.0\% & \underline{\textbf{44.9$\pm$7.3}} & 29.7$\pm$7.9 & +51.2\% & 37.0$\pm$9.4 & 28.7$\pm$7.8 & 33.5$\pm$10.1 \\
HANCOCK/K-SCC grading (2C) & \underline{\textbf{73.8$\pm$1.3}} & 71.4$\pm$5.8 & +3.4\% & \textbf{73.6$\pm$2.9} & 60.8$\pm$6.6 & +21.1\% & 67.6$\pm$2.3 & 54.7$\pm$9.2 & 70.8$\pm$4.9 \\
HANCOCK/NK-SCC grading (2C) & \underline{\textbf{67.0$\pm$5.8}} & 62.0$\pm$10.8 & +8.1\% & \textbf{61.0$\pm$10.2} & 48.0$\pm$9.7 & +27.1\% & 62.5$\pm$8.2 & 46.0$\pm$15.5 & 53.0$\pm$12.2 \\
HANCOCK/perineural invasion (2C) & \underline{\textbf{79.8$\pm$1.1}} & 76.9$\pm$0.7 & +3.8\% & \textbf{75.5$\pm$3.2} & 63.9$\pm$5.6 & +18.1\% & 75.6$\pm$2.9 & 63.0$\pm$7.8 & 76.1$\pm$0.7 \\
HANCOCK/metastasis (2C) & \underline{\textbf{74.8$\pm$1.3}} & 71.4$\pm$1.7 & +4.8\% & \textbf{64.7$\pm$3.6} & 63.2$\pm$5.5 & +2.4\% & 67.1$\pm$6.5 & 62.6$\pm$6.8 & 73.3$\pm$3.6 \\
HANCOCK/tumor site (4C) & \underline{\textbf{74.1$\pm$3.2}} & 68.5$\pm$1.9 & +8.2\% & \textbf{71.8$\pm$3.2} & 66.7$\pm$1.0 & +7.6\% & 71.3$\pm$2.0 & 60.4$\pm$2.2 & 71.1$\pm$2.4 \\
HANCOCK/vascular invasion (2C) & \textbf{55.3$\pm$6.8} & 51.6$\pm$7.6 & +7.2\% & \underline{\textbf{66.8$\pm$3.8}} & 59.9$\pm$6.0 & +11.5\% & 62.1$\pm$6.4 & 52.8$\pm$8.0 & 55.3$\pm$9.9 \\
IMP-Cervix (3C) & \textbf{46.6$\pm$2.8} & 45.0$\pm$3.6 & +3.6\% & \textbf{57.0$\pm$1.9} & 52.9$\pm$4.5 & +7.7\% & \underline{61.0$\pm$4.6} & 47.0$\pm$6.7 & 48.9$\pm$1.9 \\
\midrule
\textbf{Average} & 
\textbf{64.7} & 
\textbf{62.5} & 
\textbf{+4.5\%} & 
\textbf{64.2} & 
\textbf{58.1} & 
\textbf{+12.6\%} &
\textbf{63.9} &
\textbf{55.7} &
\textbf{62.6} \\
\bottomrule
\end{tabular}%
}
\end{table}
    \subsection{Results}

    \subsubsection{MoA enhances diagnostic predictions}

    We evaluate MoA across two modalities (cytology and pathology) and 19 downstream tasks (\autoref{tab:results}). A single, fixed training recipe---selected via the AML-Hehr ablations---is applied consistently to all datasets.

    For ABMIL experts, MoA-ABMIL matches or exceeds the single-aggregator ABMIL baseline in almost all settings. Across 19 tasks, MoA-ABMIL underperforms the baseline only once, while providing consistent positive or neutral gains elsewhere. Improvements are marginal on high-performing binary pathology tasks (e.g., BCNB, CPTAC-ALL; $\Delta \approx 0.1{-}0.8\%$), and become more pronounced on harder multi-class cohorts with lower baselines. For example, MoA-ABMIL improves balanced accuracy on CPTAC-HNSC ($+6.4\%$), CPTAC-LSCC ($+4.0\%$), CPTAC-PDA ($+11.9\%$) and CPTAC-UCEC ($+19.0\%$), and achieves the best overall performance on several HANCOCK grading and invasion tasks. On average, MoA-ABMIL yields a $+4.5\%$ relative improvement over the single-aggregator ABMIL baseline across all datasets.
    
    For mixture of Transformer experts, the effect is even stronger. MoA-TransMIL strictly dominates the single TransMIL baseline on every dataset, with an average relative gain of $+12.6\%$ (\autoref{tab:results}). Improvements are modest on already saturated tasks (e.g., CPTAC-ALL: $+0.8\%$), but become substantial on more challenging settings: BRACS ($+10.6\%$), BCNB/HER2 ($+16.4\%$), BCNB/PR ($+8.3\%$), CPTAC-PDA ($+25.5\%$), and especially CP\-TAC-UCEC ($+51.2\%$). Similar trends are observed across the HANCOCK, where MoA-TransMIL delivers large gains for K-SCC grading ($+21.1\%$), NK-SCC grading ($+27.1\%$), perineural invasion ($+18.1\%$), vascular invasion ($+11.5\%$). In cytology, MoA-TransMIL also improves over the baseline for AML-Hehr ($+4.5\%$) and cAItomorph ($+1.2\%$), confirming that the benefits of mixture-of-aggregators transfer to both smear and tissue-based tasks.
    
    Compared to alternative MIL baselines, MoA is competitive or superior on most datasets. Although the simplest, MeanMIL attains the best performance on 4/19 tasks (BCNB and CPTC-ALL), while CLAM-SB slightly outperforms other models on CPTAC-CCRCC and IMP-Cervix. Nevertheless, either MoA-ABMIL or MoA-TransMIL is the best or tied-best model on the majority of tasks (13/19). These results indicate that mixing aggregators--regardless of the choice of backbone--yields a robust improvement over both single-aggregator variants and established MIL baselines.

    The mixture of ABMIL and TransMIL aggregators within the MoA framework ($2\times$ ABMIL $+$ $2\times$ TransMIL) does not provide additional benefit and often performs between the pure ABMIL MoA and pure TransMIL MoA (Appendix \autoref{tab:mixed_moa_results_mean_std}).

    The additional benefit MoA on cytology and histology tasks are shown in \autoref{fig:fig2} with confusion matrices. Beyond overall performance, our method enhances class-level sensitivity and reduces confusion between malignant and non-malignant categories in the AML-Hehr dataset (\autoref{fig:fig2}A). Our method consistently improve performance of all classes in HANCOCK dataset (\autoref{fig:fig2}B). Importantly, the aggregator weight distributions reveal distinct specializations (\autoref{fig:fig2}C). In the primary tumor-site identification task, Aggregator~3 predominantly contributes to oropharynx cases, while Aggregator~4 is more active in other tumor regions. Jensen–Shannon divergence analysis shows that the aggregators attain different attention distributions over patches (\autoref{fig:fig2}C, D), with a mean of $0.42 \pm 0.10$.
    
        % We test our model across two modalities (cytology and pathology) and multiple organs. A single, fixed training recipe---selected via the AML-Hehr ablations---is applied consistently to all datasets.

        % For ABMIL experts, MoA consistently improves upon or matches the baseline across all datasets (\autoref{tab:results}). Notable gains are seen in hematologic cytology dataset (AML-Hehr: $+3.0\%$; cAItomorph: $+4.3\%$), BRACS ($+6.2\%$), CPTAC-OV ($+9.8\%$) and IMP-Cervix ($+9.0\%$). Marginal improvements are observed on high-performing baselines (CPTAC-ALL: $+0.18\%$; PANDA: $+0.2\%$). On average, multiple ABMIL aggregators yields a $4.06\%$ improvement over the single ABMIL baselines.  

        % For mixture of Transformer experts, improvements are consistent in cytology (AML-Hehr: $+4.0\%$; cAItomorph: $+1.8\%$). and BRACS ($+11.7\%$) (\autoref{tab:results}). Other two pathology datasets show near baseline performance (HANCOCK: $-1.6\%$; PANDA: $-0.6\%$), yet there is an average improvement of $3.06\%$ across all datasets. 
        
        % Beyond overall performance, our method enhances class-level sensitivity and reduces confusion between malignant and non-malignant categories in the AML-Hehr dataset (\autoref{fig:fig2}A). Importantly, the aggregator weight distributions reveal distinct specializations: Aggregator 4 predominantly contributes to AML subtypes, while Aggregator 2 is more active in healthy donors (\autoref{fig:fig2}B). This specialization highlights that different aggregators learn to model biologically meaningful patient groups.

        % A similar trend is observed in BRACS, where MoA improves sensitivity and specificity across all disease categories compared to the single-aggregator baseline (\autoref{fig:fig2}C). Here, Aggregator 2 is consistently active across patients, Aggregator 8 focuses on benign and atypical cases, and Aggregator 3 specializes exclusively in malignant conditions (\autoref{fig:fig2}D). Together, these results demonstrate that MoA not only improves overall performance but also encourages aggregator-level specialization that aligns with disease-specific morphology.
        
        %Our approach improves class based sensitivity and decreases the confusion between malignant vs non-malignant cases in AML-Hehr dataset (\autoref{fig:fig3}A). Aggregator weight distribution also shows distinct aggregator use between AML patients and healthy donors. While Aggregator 4 dominontly contribute to the AML patients, Aggregator 2 is more active in healthy people (\autoref{fig:fig3}B). Similarly in BRACS, MoA improves sensitivity and specificity of all disease types compared to baseline with single aggregator (\autoref{fig:fig3}C). Aggregator 2 seems to be active almost in all patients, while Aggregator 8 in benign and atypical conditions and Aggregator 3 is only working for malignant patients
        

        % Overall, MoA delivers positive average gains across both expert families and modalities, with the strongest improvements observed on heterogeneous or lower-baseline datasets, and only small regressions confined to a minority of pathology cohorts. These trends are consistent with the ablation findings: stabilizing the router via load balancing and controlled stochasticity benefits ABMIL experts robustly and transfers positively to Transformer experts, supporting the view of MoA as a general aggregation strategy across cytology and pathology tasks.
        
    \begin{figure*}[t]
      \includegraphics[width=\textwidth]{figures/Fig2_v2.pdf}
      \caption{(A) Confusion matrices for AML-Hehr dataset. (B) MoA improves class specific F1 scores. (C) Aggregator weight distributions across patients reveal class-specific specialization. (D) Attentions given by aggregators differ for patients and quantified by Jensen–Shannon divergence (JSD).}
      \label{fig:fig2}
    \end{figure*}


    \begin{figure*}[t]
      \centering
      \includegraphics[width=0.9\textwidth]{figures/Fig3_v2.pdf}
      \caption{Different aggregators capture complementary diagnostic patterns. For the AML-Hehr dataset, we show patient-level attention analyses for (A) AML with PML-RARA, (B) AML with NPM1, and (C) AML with CBFB-MYH11. Top panels display single-cell attention distributions for the top-2 contributing aggregators and their mixture weights (red: dominant, blue: secondary, purple: mixture). Bottom panels compare their attentions, illustrating how each focuses on distinct cell subsets. This complementarity enables the model to capture subtype-specific morphological patterns. Attention scores are reported in \(10^{-3}\) units.}
      \label{fig:fig3}
    \end{figure*}
    
\subsubsection{Aggregators capture distinct attention patterns}
    To further assess aggregator specialization and determine whether they capture different distributions within a bag, we conducted a patient-wise analysis and computed instance-level attentions (single-cell images) from each aggregator (\autoref{fig:fig3}). We focus on the AML-Hehr hematologic cytology dataset because single-cell contributions to disease are easier to assess and several AML subtypes exhibit pathognomonic morphologic findings. We present three representative patients from different AML subtypes. The upper panel shows the attention distributions generated by the aggregators along with their respective contribution weights.
    
    Patient YST carries a PML::RARA fusion, also known as acute promyelocytic leukemia (APL), a distinct subtype of myeloid leukemia that requires rapid diagnosis and treatment. The hallmark of APL is the presence of promyelocytes. In patient YST, the dominant aggregator (Aggregator 2) assigns high attention to several promyelocytes, while the second aggregator highlights additional promyelocytes initially overlooked by the first (\autoref{fig:fig3}A).
    
    In patient PAM, the aggregators complement each other, with both capturing myeloblasts, which are essential for AML diagnosis (\autoref{fig:fig3}B).
    
    Patient ZRJ harbors a CBFB::MYH11 fusion, a genetic abnormality associated with AML characterized by monocytic and granulocytic differentiation. In this case, the first aggregator (Aggregator 4) identifies monocytic cells as disease-specific instances (\autoref{fig:fig3}C). The second aggregator complements by assigning high attention to myeloblasts and granulocytic cells at different maturation stages. 

    For quantitative evaluation, we calculated the intersection over union (IoU) between the high-attention top-k fraction of cells selected by different aggregators (Appendix \autoref{fig:fig4}). The mean IoU across test samples is $0.16 \pm 0.17$ at $k=0.01$ and $0.21 \pm 0.10$ at $k=0.05$, showing that only a small fraction of highly attended cells are shared between aggregators.



\subsection{Ablation Study}
\label{sec:ablation_study}
    
    %We assess generalization across diverse organs and modalities by selecting a single, fixed training recipe via ablations on AML-Hehr. 
    %Because AML-Hehr provides instance-level labels, these ablations allow us to decouple bag-level prediction quality from aggregation behavior and systematically examine how the router architecture, load-balancing strength ($\lambda_{\mathrm{lb}}$), and Gumbel noise influence expert specialization, attention distributions, and downstream accuracy. 
    % The best configuration is then frozen and applied unchanged to all other datasets, eliminating test-set tuning and fairer evaluation of cross-dataset generalization. %Importantly, AML-Hehr is a cytology dataset and thus out-of-domain for most of our remaining benchmarks; choosing hyperparameters on this distinct domain yields a stricter and fairer evaluation of cross-dataset generalization.

    % \begin{table}[htbp]
    % \centering
    % \small
    % \caption{Selected router configurations on AML-Hehr. 
    % \emph{Base} = baseline (single aggregator, no router); \emph{MLP} = multi-layer perceptron. Second line shows number of aggregators used. Third line shows load-balancing loss coefficient $\lambda_{\mathrm{lb}}$; fourth line indicates whether Gumbel routing is used (True/False). \textbf{Bold} indicates the best-performing configuration.}
    % \label{tab:ablation_selected}
    % \begin{tabular}{r|ccccccc}
    % \toprule
    % \textit{Router architecture:} &
    % Base &
    % \textbf{Linear} & MLP & Linear & Linear & Linear & Linear \\
    % \textit{Number of aggregators:} &
    % -- &
    % \textbf{4} & 4 & 2 & 6 & 4 & 4 \\
    % \textit{$\lambda_{lb}$:} &
    % -- &
    % \textbf{0.01} & 0.01 & 0.01 & 0.01 & 0.10 & 0.01 \\
    % \textit{Gumbel noise:} &
    % -- &
    % \textbf{T} & T & T & T & T & F \\
    % \midrule
    % \textbf{Balanced Acc} &
    % 78.6 &
    % \textbf{81.5} & 76.1 & 76.8 & 79.3 & 80.0 & 78.1 \\
    % \textbf{\% vs Base} &
    % -- &
    % \textbf{+3.7} & -3.2 & -2.3 & +0.9 & +1.8 & -0.6 \\
    % \bottomrule
    % \end{tabular}
    % \end{table}
    
\begin{table}[htbp]
\centering
\small
\caption{Selected router configurations on AML-Hehr. \emph{Baseline} = single aggregator, no router; \emph{MLP} = multi-layer perceptron; \emph{\# aggregators} = number of aggregators; $\lambda_{\mathrm{lb}}$ = load-balancing loss coefficient. Use of Gumbel routing is indicated as True (T)/False (F). \textbf{Bold} indicates the best-performing configuration.}
\label{tab:ablation_selected}
\resizebox{\textwidth}{!}
{
\begin{tabular}{lcccccc}
\toprule
\textbf{Hyperparameter} & \textbf{Router arch} & \textbf{\# aggregators} & \textbf{Top-k} & $\boldsymbol{\lambda_{\mathrm{lb}}}$ & \textbf{Gumbel} & \textbf{bAcc ($\Delta$)} \\
\midrule
Baseline & - & - & - & - & - & $78.6$ \\
\midrule
Default & Linear & 4 & 2 & 0.01 & T & $\mathbf{81.5 (+3.7)}$ \\
\midrule
Router arch & MLP    & 4 & 2 & 0.01 & T & $76.1(-3.2)$ \\
\midrule
\# aggregators & Linear & 2 & 2 & 0.01 & T & $76.8(-2.3)$ \\
\# aggregators & Linear & 6 & 2 & 0.01 & T & $79.3(+0.9)$ \\
\midrule
$\lambda_{\mathrm{lb}}$ & Linear & 4 & 2 & 0.10 & T & $80.0(+1.8)$ \\
\midrule
Gumbel noise & Linear & 4 & 2 & 0.01 & F & $78.1(-0.6)$ \\
\midrule
Top-k & Linear & 4 & 1 & 0.01 & T & $78.3(-0.4)$ \\
Top-k & Linear & 4 & 3 & 0.01 & T & $79.5(+1.2)$ \\
Top-k & Linear & 4 & 4 & 0.01 & T & $81.1(+3.7)$ \\
\bottomrule
\end{tabular}
}
\end{table}



    We select a single, fixed MoA configuration via ablations on AML-Hehr (\autoref{tab:ablation_selected}) and apply to the rest of the dataset. Compared to the single-aggregator baseline (78.6 balanced accuracy), mixtures only help when routing is lightly regularized and stochastic. Performance is best with an intermediate number of aggregators and mild load balancing: a router with four aggregators, $\lambda_{\mathrm{lb}} = 0.01$, and Gumbel noise enabled achieves 81.5 balanced accuracy on AML-Hehr (+3.7\% vs.\ base). Stronger regularization ($\lambda_{\mathrm{lb}} = 0.10$), disabling Gumbel noise, or changing the number of aggregators (2 or 6) reduces performance. As expected, top-1 routing achieves similar performance to the baseline (78.3 vs. 78.6), while top-2 routing yields the largest improvement over the baseline (81.5). Increasing $k$ beyond 2 does not lead to further gains. We therefore adopt the 4-aggregator configuration with top-2 routing, Gumbel noise, and $\lambda_{\mathrm{lb}} = 0.01$ as the fixed training recipe for all subsequent experiments, where it generalizes well across organs and modalities (full ablations in Appendix \autoref{tab:ablation}).
    

    
\subsection{Limitations}
    We acknowledge several limitations of our study. First, as shown in prior work, no single MIL architecture uniformly outperforms all others; performance is strongly dataset dependent. In our experiments, mixtures of aggregators improve over their single-aggregator counterparts, but they may not always achieve the best performance compared to alternative MIL architectures. Second, we experimented ABMIL and TransMIL in our pipeline, future work can incorporate additional MIL backbones within the MoA framework and benchmark them against their corresponding single-aggregator baselines. Finally, although we used a fixed training recipe across all datasets, additional hyperparameter tuning may be necessary to benefit of MoA in other settings.


    
    % Second, in a subset of datasets we observe a slight decrease in the performance of MoA when a Transformer is used as the aggregator. We attribute this to the relative homogeneity of these datasets, where a standard Transformer aggregator is already sufficient to capture the underlying distribution. In such cases, the benefits of employing multiple complementary aggregators are less pronounced. In addition,,  addressing these challenges, for instance, through more memory-efficient routing mechanisms or adaptive expert selection strategies, remains an important direction for future work.
     
 

\section{Conclusion}
\label{sec:conclusion}
    We propose Mixture of Aggregators (MoA), a framework that employs multiple aggregators for multiple instance learning to better capture diverse distributions in heterogeneous datasets. Our approach improves diagnostic performance in both pathology and hematologic cytology. Within this framework, aggregators specialize in different disease types, and each specialized aggregator provides distinct, clinically relevant attention distributions over the instances. These complementary attention patterns enhance diagnostic accuracy when combined. The code is available at \url{https://github.com/fatihOzlugedik/MixtureOfAggregators}
    
\midlacknowledgments{CM received funding from the European Research Council under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement 866411 \& 101113551). We acknowledge support from the High-Tech Agenda Bayern.}

\bibliography{midl26_238}
\appendix

\newpage
\section{Supplementary methods}
    \subsection{Aggregator details}
    
    For ABMIL, we use the gated attention pooling mechanism introduced by Ilse et al \cite{ilse2018attention}. Given a bag of $n$ instance embeddings ${h_i}_{i=1}^n$, the attention weight for each instance is computed as
    \begin{equation}
      a_i = 
      \frac{\exp\!\left\{\, w^\top \bigl[\tanh(V h_i) \odot \sigma(U h_i)\bigr] \,\right\}}
           {\sum_{j=1}^n \exp\!\left\{\, w^\top \bigl[\tanh(V h_j) \odot \sigma(U h_j)\bigr] \,\right\}},\label{Eq1}
    \end{equation}
    
    where $V, U \in \mathbb{R}^{l \times d}$ and $w \in \mathbb{R}^{l}$ are learnable parameters, $\tanh(\cdot)$ and $\sigma(\cdot)$ denote the hyperbolic tangent and sigmoid activations, and $\odot$ is the element-wise product.The bag-level representation is then obtained as a weighted sum of instances.
    \begin{equation}
    z = \sum_{i=1}^n a_i h_i.
    \end{equation}

    For Transformer archicture, we use recipe by \cite{wagner2023transformer}. Self attention is defined as:
    \begin{equation}
    SA(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V,\label{Eq2}
    \end{equation}
    
    where queries $Q \in \mathbb{R}^{n \times d_k}$, keys $K \in \mathbb{R}^{n \times d_k}$, and values $V \in \mathbb{R}^{n \times d_v}$ are obtained from input embeddings $x$ via
    \begin{equation}
    Q = W_Q x, \quad K = W_K x, \quad V = W_V x,\label{Eq3}
    \end{equation}
    
    with learnable weights $W_Q \in \mathbb{R}^{d \times d_k}$, $W_K \in \mathbb{R}^{d \times d_k}$, and $W_V \in \mathbb{R}^{d \times d_v}$.

    \subsection{Training details}
    For model training, we use a fixed learning rate of $5 \times 10^{-5}$ with the AdamW optimizer (weight decay = 0.01), update gradients every 16 patients, and train for 150 epochs with early stopping based on the validation loss. We employ four aggregators and adopt a staged training schedule: during the first three epochs, all four aggregators are used equally, after which we gradually reduce the active expert count to two. The goal of this warm-up phase is to ensure that all aggregators acquire a minimal, shared understanding of the task before the router begins to decide which aggregators are most relevant for each sample. Enforcing top-2 routing encourages the aggregators to specialize rather than collapsing into a simple ensemble. 

    We used single H100 80GB GPU for running the experiments.
    %The impact of different expert numbers are explored in ablation studies
    \subsection{Dataset details}
    \label{dataset_detail}
    \textbf{AML-Hehr} \cite{hehr2023explainable} includes 189 patients from four acute myeloid leukemia (AML) genetic subtypes (PML–RARA fusion, NPM1-mutation, CBFB–MYH11 fusion and RUNX1–RUNX1T1 fusion) and healthy controls. Each patient has an average of $430 \pm 107$ single-cell white blood cell images. We hold out 43 patients as the test set. 
    
    \textbf{cAItomorph} data \cite{dasdelen2026ai} includes 2,043 patients spanning seven hematologic conditions--acute myeloid leukemia (AML), myelodysplastic syndromes (MDS), myeloproliferative neoplasms (MPN), MDS/MPN overlap syndromes, lymphoma, plasma cell neoplasms, and reactive changes--along with a healthy cohort. The number of white blood cell images per patient ranges from 55 to 500, with an average of $488 \pm 55$ cells. A total of 409 patients are held out for testing. This dataset is particularly challenging due to substantial inter-class heterogeneity and intra-class overlap.
    
    \textbf{BCNB} \cite{xu2021predicting} includes 1058 core needle biopsy slides from early breast cancer patients. It includes binary prediction tasks of ER, PR and HER2 status.
    
    \textbf{BRACS} \cite{brancati2022bracs} consist of 547 breast tissue biopsy slides from 189 patients. It includes 7 classes: normal, pathological benign, usual ductal hyperplasia, flat epithelial atypia, qtypical ductal hyperplasia, ductal carcinoma in situ and invasive carcinoma. We randomly sampled patches from the dataset to improve computational efficiency.
    
    \textbf{CPTAC-CCRCC} \cite{cptac_ccrcc_2018} has 103 slides with clear cell renal cell carcinoma. We predict the immune class of patients: low, medium, high.
    
    \textbf{CPTAC-HNSCC} \cite{cptac_hnscc_2018} has 107 slides with head and neck cancer. Immune class prediction is made (low, medium, high).
    
    \textbf{CPTAC-LSCC} \cite{cptac_lscc_2018} includes 104 slides with lung squamous cell carcinoma. Histologic grade of the patients are predicted (well differentiated vs moderately differentiated)
    
    \textbf{CPTAC-UCEC} \cite{cptac_ucec_2019} has 94 slides with endometrial carcinoma. Immune class prediction is made (low, medium, high).
    
    \textbf{HANCOCK} \cite{dorrich2025multimodal} is a multimodal dataset of 763 head and neck cancer patients. We include 6 different task from this dataset. 5 of them are binary including: Keratinizing squamous cell carcinoma grading (n=383), non keratinizing squamous cell carcinoma grading (n=74), lymphovascular invasion (n=697), perineural invasion (n=697) and primary vs.\ metastasis tumor (n=676). Prediction of primary tumor site includes 696 cases from 4 different location (oral cavity, larynx, oropharynx, hypopharynx)
    
    \textbf{IMP-Cervix} \cite{oliveira2024imp} includes 5333 samples from cervical biopsy. Cervical cancer grade is predicted: non-neoplastic, low-grade, high-grade.

    For train-test splits, we follow PathoBench benchmark \cite{zhang2025standardizing}. We held out the test split and apply 5 cross-validation within training set.
    
    
    \newpage
    \section{Supplementary findings}
    \label{supp_finding}


    \begin{figure}[!h]
    \centering
    \includegraphics[width=0.5\textwidth]{figures/Fig4_v2.pdf}
    \caption{In AML-Hehr, we quantify the overlap between the most-attended cells selected by different aggregators using the intersection over union (IoU) of their top-$k$ sets, computed for each test sample and then averaged across the test set (mean $\pm$ s.d.). We evaluate $k$ as a fraction of cells per patient ($k/N \in \{0.01, 0.05, 0.10, 0.20, 0.25, 0.50, 0.75, 1\}$). IoU increases with $k$, but remains low for small $k$ (e.g., $\mathrm{IoU}@0.01$ $0.16 \pm 0.17$, $\mathrm{IoU}@0.05$ $0.21 \pm 0.10$), indicating that aggregators prioritize largely distinct subsets of high-attention cells.}
    \label{fig:fig4}
    \end{figure}
    \clearpage

    \begin{figure}[!h]
    \centering
    \includegraphics[width=0.6\textwidth]{figures/Fig5_v2.pdf}
    \caption{Aggregator specialization over training epochs on CPTAC organ classification. We train MoA with four aggregators on the CPTAC organ classification task and visualize how the router’s per-sample weights evolve during optimization. (A) Heatmaps show router-assigned weights for each patient (rows; grouped by organ type, dashed separators) across aggregators (columns) at selected epochs (0, 1, 3, 5, 10, 20). Early in training (epochs 0–1), routing is unstable and can be dominated by a single aggregator due to random initialization. As training progresses (epochs 3–20), organ-specific routing patterns emerge, indicating specialization of different aggregators for different organ types and increased utilization of multiple aggregators. (B) Mean router weight per aggregator across all samples over epochs. After the initial transient phase, the weights stabilize, with aggregators 1 and 4 receiving higher average weight than the others, reflecting more frequent use across samples, while aggregators 2 and 3 are used less on average.}
    \label{fig:fig5}
    \end{figure}

    \begin{figure}
    \centering
    \includegraphics[width=0.8\textwidth]{figures/Fig6_v2.pdf}
    \caption{Aggregator corruption analysis on HANCOCK. To test whether the router can suppress underperforming aggregators, we intentionally created “corrupted” aggregators by freezing the weights of $k=1$ (left) and $k=2$ (right) aggregators during training, while keeping the remaining aggregators trainable. We visualize the router’s weight distribution over epochs. In both settings, the router rapidly discards the frozen aggregators and reallocates probability mass to the trainable ones, indicating that it can identify and effectively eliminate poor aggregators early in training}
    \label{fig:fig6}
    \end{figure}


\begin{table}[hbp]
\centering
\small
\caption{Performance (mean $\pm$ std, in \%) for mixed-aggregator MoA (2$\times$ABMIL + 2$\times$TransMIL) compared with pure MoA ABMIL / MoA TransMIL and their baselines.}
\label{tab:mixed_moa_results_mean_std}
\setlength{\tabcolsep}{5pt}
\resizebox{\textwidth}{!}{
\begin{tabular}{lccccc}
\toprule
\textbf{Dataset / Task} & \textbf{ABMIL} & \textbf{TransMIL} & \textbf{ABMIL MoA} & \textbf{TransMIL MoA} & \textbf{Mixed MoA} \\
\midrule
AML-Hehr  &
81.5 $\pm$ 3.7 &
78.6 $\pm$ 2.2 &
78.4 $\pm$ 2.2 &
81.5 $\pm$ 1.0 &
80.1 $\pm$ 3.2 \\

cAItomorph &
50.9 $\pm$ 1.9 &
59.4 $\pm$ 1.9 &
52.8 $\pm$ 1.8 &
60.1 $\pm$ 1.1 &
58.6 $\pm$ 1.6 \\

BCNB/ER &
91.2 $\pm$ 0.4 &
85.7 $\pm$ 3.3 &
91.3 $\pm$ 0.2 &
88.4 $\pm$ 0.5 &
90.6 $\pm$ 0.9 \\

BCNB/HER2 &
84.0 $\pm$ 0.5 &
69.8 $\pm$ 3.5 &
84.0 $\pm$ 0.7 &
81.2 $\pm$ 2.2 &
78.3 $\pm$ 2.3 \\

BCNB/PR &
88.2 $\pm$ 0.6 &
78.2 $\pm$ 3.5 &
88.3 $\pm$ 0.4 &
84.7 $\pm$ 0.7 &
87.0 $\pm$ 0.5 \\

BRACS  &
34.4 $\pm$ 1.5 &
26.5 $\pm$ 1.5 &
34.6 $\pm$ 1.9 &
29.3 $\pm$ 1.6 &
35.4 $\pm$ 2.4 \\

CPTAC-ALL  &
95.5 $\pm$ 0.2 &
95.5 $\pm$ 1.0 &
96.1 $\pm$ 0.3 &
96.3 $\pm$ 0.6 &
95.9 $\pm$ 0.5 \\

CPTAC-CCRCC &
43.5 $\pm$ 4.3 &
45.2 $\pm$ 3.7 &
45.4 $\pm$ 7.7 &
47.2 $\pm$ 4.1 &
48.0 $\pm$ 2.7 \\

CPTAC-HNSC &
33.0 $\pm$ 4.6 &
27.9 $\pm$ 5.8 &
35.1 $\pm$ 5.7 &
31.6 $\pm$ 3.6 &
35.1 $\pm$ 3.1 \\

CPTAC-LSCC &
67.1 $\pm$ 3.5 &
60.0 $\pm$ 9.4 &
69.8 $\pm$ 2.2 &
63.7 $\pm$ 5.2 &
63.5 $\pm$ 4.0 \\

CPTAC-PDA / &
35.1 $\pm$ 3.0 &
32.9 $\pm$ 7.0 &
39.3 $\pm$ 7.3 &
41.3 $\pm$ 4.1 &
41.3 $\pm$ 2.3 \\

CPTAC-UCEC  &
36.3 $\pm$ 5.6 &
29.7 $\pm$ 7.9 &
43.2 $\pm$ 2.8 &
44.9 $\pm$ 7.3 &
42.1 $\pm$ 3.1 \\

HANCOCK/K-SCC grading &
71.4 $\pm$ 5.8 &
60.8 $\pm$ 6.6 &
73.8 $\pm$ 1.3 &
73.6 $\pm$ 2.9 &
73.0 $\pm$ 1.7 \\

HANCOCK/NK-SCC grading &
62.0 $\pm$ 10.8 &
48.0 $\pm$ 9.7 &
67.0 $\pm$ 5.8 &
61.0 $\pm$ 10.2 &
59.0 $\pm$ 10.2 \\

HANCOCK/perineural invasion &
76.9 $\pm$ 0.7 &
63.9 $\pm$ 5.6 &
79.8 $\pm$ 1.1 &
75.5 $\pm$ 3.2 &
76.7 $\pm$ 3.0 \\

HANCOCK/metastasis &
71.4 $\pm$ 1.7 &
63.2 $\pm$ 5.5 &
74.8 $\pm$ 1.3 &
64.7 $\pm$ 3.6 &
68.2 $\pm$ 2.4 \\

HANCOCK/tumor site &
68.5 $\pm$ 1.9 &
66.7 $\pm$ 1.0 &
74.1 $\pm$ 3.2 &
71.8 $\pm$ 3.2 &
72.2 $\pm$ 2.4 \\

HANCOCK/vascular invasion &
51.6 $\pm$ 7.6 &
59.9 $\pm$ 6.0 &
55.3 $\pm$ 6.8 &
66.8 $\pm$ 3.8 &
66.1 $\pm$ 5.1 \\

IMP-Cervix  &
45.0 $\pm$ 3.6 &
52.9 $\pm$ 4.5 &
46.6 $\pm$ 2.8 &
57.0 $\pm$ 1.9 &
53.4 $\pm$ 3.0 \\

\bottomrule
\end{tabular}
}
\end{table}


    
    
    
    \begin{table}[hbp]
    \centering
    \renewcommand{\arraystretch}{1.25}
    \caption{Full ablation of different configurations on AML-Hehr data. \emph{Base} = baseline (single aggregator, no router); \emph{Lin.} = Linear; \emph{MLP} = multi-layer perceptron. Second line shows number of experts used. Third line for load balancinf loss coeff $\lambda_{\mathrm{lb}}$; fourth line Gumbel routing (True/False). \textbf{Bold} indicates the best-performing configuration}
    \label{tab:ablation}
    \resizebox{\textwidth}{!}{
    \begin{tabular}{r|ccccccccccccccccccccccccc}
    \toprule
    \textit{router-arch:} &
    Base &
    MLP & Lin & Lin & MLP &
    MLP & Lin & Lin & MLP &
    MLP & Lin & Lin & MLP &
    Lin & MLP & MLP & Lin &
    Lin & MLP & MLP & \textbf{Lin} &
    Lin & MLP & MLP & Lin \\
    
    \textit{\#Experts:} &
    -- &
    2 & 2 & 2 & 2 &
    4 & 4 & 4 & 4 &
    6 & 6 & 6 & 6 &
    2 & 2 & 2 & 2 &
    4 & 4 & 4 & \textbf{4} &
    6 & 6 & 6 & 6 \\
    
    \textit{$\lambda_{lb}$:} &
    -- &
    0.01 & 0.01 & 0.10 & 0.10 &
    0.01 & 0.01 & 0.10 & 0.10 &
    0.01 & 0.01 & 0.10 & 0.10 &
    0.10 & 0.01 & 0.10 & 0.01 &
    0.10 & 0.01 & 0.10 & \textbf{0.01} &
    0.10 & 0.01 & 0.10 & 0.01 \\
    
    \textit{Gumbel:} &
    -- &
    F & F & F & F &
    F & F & F & F &
    F & F & F & F &
    T & T & T & T &
    T & T & T & \textbf{T} &
    T & T & T & T \\
    \midrule
    
    \textbf{Balanced Acc} &
    78.6 &
    79.1 & 78.5 & 78.5 & 79.1 &
    74.9 & 78.1 & 75.8 & 78.5 &
    79.1 & 78.1 & 77.6 & 72.2 &
    76.8 & 79.6 & 77.8 & 76.8 &
    80.0 & 76.1 & 76.3 & \textbf{81.5} &
    74.1 & 78.0 & 78.2 & 79.3 \\
    \textbf{\% vs Base} &
    -- &
    +0.6 & -0.1 & -0.1 & +0.6 &
    -4.7 & -0.6 & -3.6 & -0.1 &
    +0.6 & -0.6 & -1.3 & -8.1 &
    -2.3 & +1.3 & -1.0 & -2.3 &
    +1.8 & -3.2 & -2.9 & \textbf{+3.7} &
    -5.7 & -0.8 & -0.5 & +0.9 \\
    \bottomrule
    \end{tabular}
    }
    \end{table}
    
    
    \begin{table}
    \centering
    \caption{Inference times of different MIL architectures on the HANCOCK dataset. MoA increases inference time by approximately $2.5\times$ compared to a single-aggregator model.}
    \label{tab:hancock_final_fold_time_moa}
    \begin{tabular}{lccc}
    \toprule
    Architecture   & Total time [s] & Time / sample [ms] & \# of parameters \\
    \midrule
    ABMIL          & 0.052 & 0.368 & 0.69 M \\
    MoA-ABMIL      & 0.148 & 1.052 & 3.15 M \\
    TransMIL       & 0.462 & 3.274 & 2.54 M \\
    MoA-TransMIL   & 1.100 & 7.805 & 10.56 M \\
    DSMIL          & 0.101 & 0.716 & 0.86 M \\
    CLAM-SB        & 0.198 & 1.401 & 0.79 M \\
    Mean           & 0.025 & 0.173 & 0.43 M \\
    \bottomrule
    \end{tabular}
    \end{table}

    \begin{table}[hbp]
    \centering
    \small
    \caption{Comparison of MoA with ensemble baselines on AML-Hehr. We compare a single-aggregator baseline with MoA using top-2 routing and with two ensemble baselines that use all four aggregators without routing. We evaluate both averaging the aggregator outputs (mean ensemble) and concatenation (concat ensemble). Median inference time is reported per sample, along with the relative slowdown compared with the single-aggregator baseline. MoA achieves the best balanced accuracy ($81.5 \pm 1.0$) while requiring only a $2.05\times$ slowdown. Mean ensemble achieves similar performance with higher inference time.}
    \label{tab:ensemble_baseline}
    \setlength{\tabcolsep}{6pt}
    \resizebox{\textwidth}{!}{
    \begin{tabular}{lcccc}
    \toprule
    \textbf{Method} & \textbf{\# aggregators} & \textbf{Balanced} & \textbf{Median inference time} & \textbf{Relative slowdown} \\
     &  & \textbf{Acc.} & \textbf{(ms / sample)} & \textbf{vs. single} \\
    \midrule
    Baseline & 1 & $78.6 \pm 2.2$ & 1.783 & 1.0$\times$ \\
    MoA (top-2 routing) & 4 & $81.5 \pm 1.0$ & 3.663 & 2.05$\times$ \\
    Mean Ensemble & 4 & $81.3 \pm 1.2$ & 6.343 & 3.56$\times$ \\
    Concat Ensemble & 4 & $78.3 \pm 2.2$ & 6.330 & 3.55$\times$ \\
    \bottomrule
    \end{tabular}
    }
    \end{table}
    
    \pagebreak
    \end{document}
