MGPATH: A Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot Whole Slide Pathology Classification

Published: 05 Oct 2025, Last Modified: 05 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this https://github.com/HauschildLab/MGPATH.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1) In the appendix section, adding ablation studies on the effectiveness of data augmentation for two configurations, cosine similarity and optimal transport, on the performance of MGPATH(PLIP-G). 2) In the appendix section, we add ablation studies to investigate the running time of MGPATh(PLIP-G) with two configurations: cosine similarity and optimal transport. 3) In the Ablation Study section, include the figures of the results for additional few-shots experiments with K=2, 4, 6, 8, 16. 4) Update the manuscript to correct the abbreviations of MGPATH. Adding an explanation to explain the abbreviations MGPATH (PLIP-G), MGPATH(PLIP), and MGPATH(ViT). 5) Provide ablation experiments to investigate the performance of MGPATH(PLIP-G) with text-prompts generated by various large language models, such as GPT-4o, or Grok-3. 6) Publicly released the code repository.
Assigned Action Editor: ~Frederic_Sala1
Submission Number: 4842
Loading