ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification

Jiangbo Shi, Chen Li, Tieliang Gong, Yefeng Zheng, Huazhu Fu

Published: 16 Jun 2024, Last Modified: 08 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide im-age (WSI) with giga-pixel size and hierarchical image con-text in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily af-fected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the lan-guage prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, there-fore does not substantially boost the model’s performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and source-intensive. To solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we pro-pose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the per-formance of VLM effectively. To transfer the VLM to pro-cess WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch fea-tures progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorpo-rating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.