Keywords: Causal Learning, Whole Slide Image Classification, Vision-Language, Large Language Model
Abstract: Traditional Multiple Instance Learning (MIL), as a core method for Whole Slide Image (WSI) classification in computational pathology, often leads to model misclassification due to insufficient information when relying solely on visual representations. The introduction of Large Language Models (LLMs) has provided rich textual prompts to enhance visual representations. However, the data-driven learning of LLMs often induces spurious correlations between visual signals and text, causing inaccurate textual descriptions that pollute the alignment process and degrade WSI classification performance. To address this issue, we propose a Causal-learning Dual-attention MIL framework (CDMIL). The framework first achieves preliminary alignment through a prototype-guided dual-attention mechanism, followed by a counterfactual learning strategy for causal intervention. Replacing factual text with counterfactual text forces the model to abandon its reliance on spurious correlations and instead learn genuine causal relationships. Experiments demonstrate that CDMIL achieves state-of-the-art performance in both accuracy and out-of-distribution robustness, validating the superiority of this causal learning framework. The code will be released at https://github.com/xxx/CDMIL.
Primary Area: causal reasoning
Submission Number: 3577
Loading