Few-Shot Whole Slide Pathology Classification with Multi-Granular Vision-Language Models

Published: 06 Mar 2025, Last Modified: 01 Apr 2025ICLR 2025 FM-Wild WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: few-shot pathology, vision-language models
TL;DR: a multi-granular prompt learning method to advance few-shot pathology classification
Abstract: In this study, we propose a novel architecture for a large vision-language model adapted with a multi-granular prompt learning method to advance few-shot pathol- ogy classification. Starting with the Prov-GigaPath foundation model - pre-trained on 1.3 billion pathology image patches - we extend it into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. In contrast to previous approaches that combine prompts with frozen features using prefix embeddings or self-attention, our multi- granular attention mechanism evaluates interactions between learnable prompts, individual image patches, and patch groups, capturing both fine details and broader context. We further improve the precision with an unbalanced optimal transport- based visual-text distance that mitigates perturbations from data augmentation. Experiments on lung and kidney pathology imaging modalities show that our method outperforms state-of-the-art competitors and improves performance across various architectures, including CLIP, PLIP, and the Prov-GigaPath integrated PLIP.
Submission Number: 89
Loading