\section{Ablation Studies}


\myparagraph{Random vs Uniform Counterfactual Intervention}
We study the impact of the counterfactual attention distribution used for intervention. Table~\ref{tab:ablation_rebuttal} compares random and uniform counterfactual attentions across BRCA and NSCLC. Both strategies lead to consistent improvements in attention faithfulness compared to ABMIL, validating that the proposed causal mechanism is not sensitive to a specific choice of counterfactual distribution. However, subtle differences emerge between the two variants. Random counterfactual attention generally yields slightly lower AUPC, especially on NSCLC with UNI features, indicating stronger selective pressure on attention. Uniform counterfactual attention, by contrast, occasionally preserves marginally higher AUC at the cost of slightly degraded AUPC. This suggests that random intervention introduces stronger stochastic perturbations that better suppress spurious correlations, whereas uniform intervention acts as a weaker regularizer.

\input{tabs/ablation_310}


\begin{figure}[t]
  \centering
  % \vspace{-1.0em}
  \includegraphics[width=0.8\linewidth]{figs/causal_mil_auc_vs_aupc_overlay_310.png}
  \caption{Performance-interpretability trade-off controlled by the causal effect weighting coefficient $\lambda$.}
  \label{fig:auc_aupc}
  % \vspace{-1.0em}
\end{figure}

\myparagraph{Causal Effect Weighting: Performance-Interpretability Trade-off} We analyse in Fig.\ref{fig:auc_aupc} the influence of the causal effect weighting coefficient $\lambda$ on the balance between predictive performance and attention interpretability. When $\lambda=0$, the model reduces to a standard attention-based MIL baseline without causal supervision, resulting in weaker perturbation sensitivity. As $\lambda$ increases, the drop in AUPC increases on both BRCA and NSCLC, indicating progressively more selective and causally aligned attention. Importantly, AUC remains stable within a narrow range, showing that the improvement in interpretability comes at a moderate and controlled cost in predictive performance. 


\begin{table}[t]
\centering
\caption{Attention stability across folds measured by slide-wise Spearman correlation (\%, mean $\pm$ std). We report (i) cross-architecture sensitivity by comparing ABMIL attentions to other MIL models, and (ii) within-architecture stability between baseline and CIA-augmented attentions .}
\label{tab:attn_stability_all}

\resizebox{0.8\textwidth}{!}{
    \begin{tabular}{lccc}
    \hline
     & All Slides & Positive Slides & Positive Patches \\
    \hline
    ABMIL vs ACMIL      & 71.4 $\pm$ 06.9  & 75.0 $\pm$ 04.9  & 76.4 $\pm$ 05.6  \\
    ABMIL vs AddMIL     & 84.3 $\pm$ 12.1 & 87.2 $\pm$ 09.3  & 86.8 $\pm$ 05.2  \\
    ABMIL vs CLAM       & 67.7 $\pm$ 15.5 & 72.8 $\pm$ 12.2 & 69.7 $\pm$ 02.8  \\
    ABMIL vs DSMIL      & 43.1 $\pm$ 08.3  & 58.2 $\pm$ 13.0 & 70.2 $\pm$ 04.4  \\
    ABMIL vs IBMIL      & 91.1 $\pm$ 02.9  & 92.6 $\pm$ 02.4  & 91.2 $\pm$ 02.1  \\
    ABMIL vs MHIM       & 57.1 $\pm$ 13.3 & 63.2 $\pm$ 12.4 & 65.1 $\pm$ 05.7  \\
    \hline
    ABMIL vs CIA-MIL    & 34.6 $\pm$ 24.6 & 48.7 $\pm$ 19.3 & 79.8 $\pm$ 01.2  \\
    ACMIL vs CIA-ACMIL  & 76.8 $\pm$ 09.0  & 79.3 $\pm$ 06.6  & 84.9 $\pm$ 04.3  \\
    CLAM vs CIA-CLAM    & 78.7 $\pm$ 10.4 & 82.0 $\pm$ 08.2  & 79.5 $\pm$ 06.0  \\
    \hline
    \end{tabular}
    }
\end{table}


\myparagraph{Attention Stability} To assess the sensitivity of attention to architectural changes and to counterfactual intervention, we compute slide-wise Spearman correlation between attention scores produced by different models. For each slide, correlations are computed between pairs of attention vectors of different models. We report results over (i) all slides, (ii) slides containing tumor (positive slides), and (iii) pathology-relevant regions only (positive patches) in positive slides, as summarized in Table~\ref{tab:attn_stability_all}. Across both cross-architecture comparisons and baseline–CIA pairs, we observe that correlations are consistently higher when restricted to pathology-relevant regions (positive patches) than when computed over all patches. In contrast, lower correlations at the slide level appear to be driven mainly by variability in the ranking of patches outside annotated tumor regions. This indicates that differences between models and between baseline and CIA variants predominantly affect background or non-discriminative regions, while the relative ordering of tumor-associated patches remains more stable. Our approach, when applied to various MIL methods, thus does not degrade the high attention of important patches if the MIL baseline was already correct, but instead corrects the attention of unimportant patches, leading to a more explainable and interpretable attention.
