PathoPrompt: Cross-Granular Semantic Alignment for Medical Pathology Vision-Language Models

Published: 01 Jan 2025, Last Modified: 04 Nov 2025MICCAI (7) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Pre-trained visual-language (V-L) models have demonstrated impressive generalization capabilities on various downstream tasks, yet their performance is significantly influenced by the input text prompts. Previous studies (e.g., CoPrompt) have attempted to use detailed descriptions generated by LLM to assist model learning. For example, while a coarse-grained prompt like “A photo of Debris.” may be less informative, a fine-grained description such as “Debris consists of dead cells and matrix fragments.” provides additional context, resulting in enhanced model performance. However, existing methods generally lack the sensitivity to capture the subtle semantic differences that are crucial for accurately classifying pathology images. To tackle this challenge, we introduce PathoPrompt, a framework that leverages Cross-Granular Semantic Alignment to improve sensitivity to refine the model’s ability to capture subtle semantic variations in pathology image classification. Specifically, we introduce token-level fine-grained alignment, allowing the model to capture subtle differences that are crucial for accurate pathology image classification. Further, Cross-Granular Semantic Distillation improves the model’s ability to generalize by filtering out irrelevant information from both coarse and fine-grained prompts. Moreover, PathoPrompt employs a prototype-based cross-modal separation mechanism, promoting distinct class boundaries by separating image and text semantics for more effective multi-modal representation learning. Experiments on five pathology datasets and three different task types demonstrate that our method achieves superior performance compared to previous methods.
Loading