\section{Results and Discussion}
We evaluate and challenge our method against standard ABMIL \cite{ilse2018attentionbaseddeepmultipleinstance} and its variants, where additional modules are added to mitigate problems encountered during training relative to attention, through clustering (CLAM-SB \cite{clam}), multi-head attention and attention masking (ACMIL \cite{zhang2024attentionchallengingmultipleinstancelearning}), attention stage (AddMIL \cite{javed2022additivemilintrinsicallyinterpretable}), hard instances mining in a 2-stage training framework (MHIM \cite{mhim-mil}). As well as DSMIL \cite{li2021dualstreammultipleinstancelearning}, where attention is calculated in a dual stream manner, but the model still satisfies equations \ref{eq:inst_embed_sup} and \ref{eq:inference_eq}. We compare downstream performance also against instance-based baselines, namely average and maximum pooling strategies (MeanMIL/MaxMIL) that do not contain attention modules. We also compare against IBMIL \cite{lin2023interventionalbagmultiinstancelearning} on Camelyon16.
When needed, and if not explained differently, raw attention of models is used to assess their reliability. We report downstream tasks performance metrics in terms of Area Under the Curve (AUC) and F1 score, and AUPC on correct predictions to assess attention as an explainability method. For Camelyon 16, for which we have access to fine-grained annotations at the level of patches, we report the AUPRC, the area under the precision-recall curve for attention as a prediction of instance labels, using a sigmoid operator. We further assess AOPCR consistent with prior work \cite{early2024inherentlyinterpretabletimeseries}, which contrasts targeted perturbation with random perturbation to assess the informativeness of attention rankings in comparison with a random ranking of patches, and pointing game of top 5 patches PG@5, defined as the percentage of metastatic test slides for which at least one of the top-5 most attended patches overlaps an annotated metastasis region. 



\myparagraph{TCGA Benchmark}
We evaluate all models on BRCA and NSCLC tumor subtyping and on the more challenging LUAD TP53 mutation prediction task (Table~\ref{tab:tcga_scores_rebuttal}). Across BRCA and NSCLC, most attention-based MIL models, particularly ABMIL-derived methods, achieve very high AUC, often exceeding 0.95 with in-domain foundation models such as the UNI features. However, this strong predictive performance is accompanied by large variability in attention faithfulness. ABMIL-based models frequently exhibit elevated AUPC across slides as reported by the elevated standard deviations, whereas DSMIL shows a more favorable performance-explainability trade-off, although with sensitivity to the feature encoder.  \ours applied with baseline ABMIL  achieves among the lowest AUPC while maintaining competitive  AUC across both feature extractors, reflecting an explicit and stable performance–explainability balance. On LUAD, where overall performance drops across all methods, \ours still achieves low AUPC with competitive AUC, indicating that counterfactual supervision remains effective even in low-signal data.
An important insight is that when attention-based MIL models already achieve strong downstream performance, introducing the causal intervention primarily serves to guide attention toward more causally meaningful patterns. Conversely, in settings where predictive performance is weaker, the intervention can also stabilize and improve the training process itself, leading to more reliable attention and better overall performance trade-off.


\input{tabs/res_tcga_310}

\begin{figure}[htp]
{\includegraphics[width=1\linewidth]{figs/perturbation_and_aupc_BRCA_NSCLC_310.png}}
   \caption{Attention-based perturbation analysis on the BRCA and NSCLC subtyping tasks. The curves show the evolution of the target class probability as increasingly larger fractions of highly attended patches are removed. Boxplots report statistics over slides.}
   \label{fig:aupc_brca}
  \vspace{-1.0em}
 \end{figure}

To further analyze the sensitivity of each model to attention-guided perturbations, Fig.\ref{fig:aupc_brca} presents the perturbation curves on BRCA and NSCLC. Models with causally faithful attention exhibit more pronounced degradation of prediction confidence when highly attended patches are removed, whereas models with diffuse attention remain comparatively insensitive. \ours consistently shows steeper decay profiles with respect to baseline ABMIL counterpart, confirming that counterfactual supervision effectively reshapes attention toward causally meaningful tissue regions for the model's prediction. We note, however, that limited prediction degradation under perturbation does not necessarily imply poor representations: attention mechanisms may still support strong feature learning and accurate bag-level predictions even when they are weakly causally coupled to the final decision. In such cases, attention should be interpreted as a latent aggregation mechanism rather than as a faithful explanation of the model’s prediction.



\myparagraph{Camelyon16 Benchmark} We evaluate models on Camelyon16 with UNI features to explicitly disentangle bag-level performance from instance-level interpretability. Table~\ref{tab:C16-UNI} reports bag-level classification performance (AUC, F1) together with instance-level attention faithfulness (AUPC, AOPCR)  and localization accuracy (AUPRC, PG@5). AUPRC is computed  by applying a sigmoid to raw patch attentions (Sig). For DSMIL, an additional normalization of raw attention with respect to the embedding dimensionality is applied to ensure comparability with ABMIL-based variants.


At the bag level, nearly all attention-based MIL models achieve near-saturated performance, with AUC values consistently above 98\%. However, clear differences emerge at the instance level. 

In particular, introducing CIA module achieves a favorable balance across instance-level metrics, combining lower AUPC and higher AOPCR with competitive AUPRC and localization performance. Importantly, this reduction in AUPC is accompanied by consistent agreement with annotated metastatic regions, as evidenced by AUPRC and PG@5, rather than by degenerate sparsity or loss of localization. This suggests that counterfactual intervention in attention yields attention that is simultaneously globally consistent, spatially selective, and causally aligned with outcome without compromising slide-level performance. We note, however, that when CIA is combined with methods that already include strong attention-level regularization mechanisms, such as CLAM, careful tuning of the intervention strength might be required. In such cases, an overly strong causal regularization may interfere with existing attention constraints and lead to slight degradation in predictive performance. 


\input{tabs/res_c16_310}

\paragraph{Qualitative Attention Analysis.}

Fig.\ref{fig:heatmaps} provides a qualitative comparison of attention maps on representative crops from a metastatic Camelyon16 slide. The first row shows a metastatic region (red annotations), followed by the corresponding ABMIL, ACMIL and \ours attention overlays, while the second row displays a non-metastatic region. ABMIL exhibits more diffuse attention and assigns non-negligible attention to non-metastatic tissue. In contrast, ACMIL and \ours concentrate attention sharply within metastatic regions and suppress responses in non-metastatic areas. 

\begin{figure}[ht]
  \includegraphics[width=0.95\linewidth]{figs/heat_310.png}
  \caption{\textbf{Qualitative comparison of attention maps on a Camelyon16 WSI.} Attention is normalised with min-max slide within the slide for compared models: ABMIL, ACMIL, and \ours.}\label{fig:heatmaps}
  \vspace{-2.0em}
\end{figure}

\myparagraph{Why Still Use ResNet in the Era of Foundation Models?}
Although foundation models such as UNI \cite{chen2024uni} provide in-domain representations, we include ResNet50 \cite{he2016deep} pretrained on ImageNet \cite{ImageNet}  as a feature extractor. In fact, evaluating both out-of-domain and in-domain features allows us to disentangle the effect of representation learning from that of attention supervision. Also, the fact that \ours exhibits consistent faithfulness improvements across both ResNet50 and UNI demonstrates that the proposed counterfactual intervention is not tied to a specific representation space. Notably, while UNI generally yields higher absolute AUC values, attention faithfulness as measured by AUPC does not automatically improve with stronger foundation features. In some cases, certain configurations of MIL models can achieve the same performance or even slightly degraded with foundation models compared to generic representations, as in the example of DSMIL on the LUAD TP53 mutation prediction as reported in Table~\ref{tab:tcga_scores_rebuttal}. This further supports our central claim: improvements in representation power alone do not guarantee causally meaningful attention, and explicit counterfactual supervision of attention can help mitigate this issue.
