\section{Introduction}

Multiple Instance Learning (MIL) is a weakly supervised learning framework designed for scenarios in which data are organized into sets of instances, referred to as bags, while supervision labels are only provided at the bag level ~\cite{MIL,mil_maron1997framework}. MIL is a relevant framework in many domains where fine-grained annotations are scarce or expensive, such as digital histopathology \cite{article_wsi_mil_1, histo_mil_1, histo_mil_3, histo_mil_2}, where labels are typically provided at the level of Whole Slide Images (WSIs), which can reach extremely large resolutions (often as large as 100k pixels per side)\cite{clam}.

In the standard MIL assumption, a bag is labelled positive if at least one of its instances is positive, and negative otherwise. While this assumption is appropriate in certain detection tasks, it is often restrictive for real-world biomedical applications. In practice, slide-level labels usually arise from complex interactions between multiple tissue regions rather than from a single discriminative patch. To address this limitation, numerous attention-based MIL approaches have been proposed, enabling models to learn instance-level importance directly from data \cite{ilse2018attentionbaseddeepmultipleinstance, shao2021transmiltransformerbasedcorrelated, li2021dualstreammultipleinstancelearning}. These methods replace strategies such as max or mean pooling with learnable attention mechanisms that assign relevance scores to each instance, referred to as attention.

Despite the remarkable success of attention-based MIL in digital pathology, important questions remain regarding the reliability of attention as an interpretability tool. Attention weights are often visualized as heatmaps and interpreted as indicators of model reasoning \cite{lu2021ai, wagner2023transformer}. However, recent work on interpretability has shown that attention may not reliably reflect the true importance of instances for the final prediction \cite{hense2025xmilinsightfulexplanationsmultiple,zhang2022dtfdmildoubletierfeaturedistillation, early2024inherentlyinterpretabletimeseries, javed2022additivemilintrinsicallyinterpretable}. In some cases, models can achieve similar predictions while relying on substantially different attention distributions, revealing a potential disparity between attention and decision-making. This limitation is particularly critical in medical applications, where interpretability is not merely a convenience but a prerequisite for trust, clinical adoption, and regulatory approval.

In this work, we investigate the reliability of raw attention scores as a proxy for interpretability in MIL models. We study a broad family of attention-based MIL architectures within a unified experimental framework and quantitatively and qualitatively report how different design choices such as attention type, multihead formulations, clustering strategies, and entropy regularization affect the attention effect on predictions. In a second part, we explore the use of causal counterfactual intervention to guide the learning of attention toward representations that are more causally aligned with the model's predictions. By enforcing counterfactual consistency during training, our goal is to promote attention patterns that 
reflect more accurately  the underlying causal mechanisms. This introduces an explicit trade-off between predictive performance and interpretability, which we characterize empirically. 

The main contributions of this work can be summarized as follows: \textit{(i) Attention reliability analysis}, we perform an extensive evaluation of attention reliability across a wide range of MIL models, including standard attention MIL and its variants such as clustering-based MIL, multi-head attention MIL, and DSMIL. Our analysis reveals that the attention byproduct of current state-of-the-art MILs does not fully align causally with downstream prediction. \textit{(ii) Novel counterfactual attention intervention framework (\ours),} we propose a novel counterfactual-guided attention learning strategy designed to improve the causal alignment and stability of attention mechanisms in MIL models. \textit{(iii) Analysis of the interplay between downstream performance and attention interpretability}, we conduct extensive experiments on real-world digital pathology datasets, demonstrating the complex trade-offs between predictive performance and interpretability of attention. Importantly, we do not claim to provide a method to increase downstream performance, but rather to maintain the performance of current MIL methods while improving their explainability, as this is crucial when considering the safe deployment of medical imaging techniques.