\section{Related Work}

\myparagraph{Attention-Based MIL}
Embedding-level MIL models operate directly in the instance embedding space to compute a compact bag-level representation that is subsequently passed to a classifier. In the standard attention-based MIL formulation \cite{ilse2018attentionbaseddeepmultipleinstance}, the bag representation is expressed as a weighted sum of instance embeddings, where the attention weights are learned functions of the instances themselves. Building upon this framework, numerous variants have been proposed to improve representational capacity and performance. TransMIL \cite{shao2021transmiltransformerbasedcorrelated} introduces self-attention \cite{wagner2023transformer, xiong2021nystromformernystrombasedalgorithmapproximating}, while CLAM \cite{clam} integrates clustering-based attention to capture multiple discriminative regions within a bag. DSMIL \cite{li2021dualstreammultipleinstancelearning} adopts a dual-stream architecture to explicitly model instance-level and bag-level interactions. Other extensions leverage multi-head attention and multi-branch aggregation strategies. More recent models aim to regularize attention, reducing over-reliance on a few highly activated instances. For instance, MHIM-MIL \cite{mhim-mil} adopts a Siamese framework with masked attention to mine hard-to-classify instances, while ACMIL \cite{zhang2024attentionchallengingmultipleinstancelearning} introduces multi-branch attention together with stochastic top-$K$ instance masking to promote diversity in discriminative patterns. To further improve learning under limited supervision, data distillation and pseudo-bag generation strategies such as DTFD-MIL \cite{zhang2022dtfdmildoubletierfeaturedistillation} have also been proposed. 

\myparagraph{Interpretability of MIL Models}
In digital histopathology, attention-based MIL models typically rely on attention scores to generate patch-level relevance maps that highlight regions of interest. Despite their intuitive appeal, several studies have shown that raw attention maps do not necessarily provide faithful explanations of model behavior \cite{hense2025xmilinsightfulexplanationsmultiple, javed2022additivemilintrinsicallyinterpretable, zhang2022dtfdmildoubletierfeaturedistillation}. To address this limitation, alternative explainability strategies have been proposed. DTFD \cite{zhang2022dtfdmildoubletierfeaturedistillation} reframes MIL as an equivalent image classification problem and derives patch-level importance scores using Grad-CAM \cite{gradCam}. Other post-hoc explainability strategies have also been proposed \cite{ adebayo2018sanity,AI_reliability,histo_interp_1, histo_interp_2, bach2015pixel,baehrens2010explain, shrikumar2017learning, montavon2019layer, hense2025xmilinsightfulexplanationsmultiple}, including perturbation-based approaches \cite{early2024inherentlyinterpretabletimeseries}, while fully additive models explicitly decompose the bag-level prediction into a sum of instance contributions \cite{javed2022additivemilintrinsicallyinterpretable}. Despite these advances, many existing interpretability methods suffer from high computational cost, limited scalability to large bags, or simplifying assumptions that neglect complex inter-instance dependencies. Developing MIL models that are simultaneously accurate, scalable, and faithfully interpretable remains an open challenge.

\myparagraph{Causal Inference in MIL}
Recently, causality \cite{bookofwhy, pearl} has been introduced into the MIL framework to account for confounding factors that may compromise model training. Models such as CAMIL \cite{chen2024camil} and CATTMIL \cite{catt} formulate the bag-level representation as a mediator between patch embeddings and the final prediction by applying a front-door adjustment \cite{pearl}. In contrast, IBMIL \cite{lin2023interventionalbagmultiinstancelearning} adopts a back-door adjustment \cite{pearl} strategy via a two-stage training procedure to explicitly control for co-founders. These approaches highlight the growing interest in causal reasoning within MIL pipelines, particularly for medical imaging applications, where biased signals affecting images might degrade prediction reliability. In our work, instead of reasoning in the image space, we operate directly at the attention level. We climb the causality ladder higher than the adjustment level and apply counterfactual intervention in attention. The objective is not bias removal, but improving the faithfulness of inherent attention to directly explain the downstream prediction.


\myparagraph{Counterfactual Intervention for Attention Learning}
Counterfactual analysis provides a principled framework to measure the causal influence of input features on model predictions. In computer vision, counterfactual attention learning has been introduced to guide attention mechanisms through causal supervision rather than relying solely on conventional likelihood maximization. Notably, \cite{rao2021counterfactualattentionlearningfinegrained} proposes a counterfactual attention learning framework that explicitly maximizes the prediction difference between factual and counterfactual attentions to encourage the discovery of causally effective visual regions. While these ideas have demonstrated strong performance in fine-grained recognition and re-identification tasks, their integration into multiple instance learning for digital pathology remains largely unexplored. Our work bridges this gap by formulating counterfactual causal supervision directly at the MIL aggregation level. 


