Contrastive Pretraining for Computational Pathology with Visual-Language Models

Qifeng Zhou, Thao M. Dang, Yuzhi Guo, Hehuan Ma, Wenliang Zhong, Saiyang Na, Jean Gao, Junzhou Huang

Published: 12 May 2025, Last Modified: 25 Mar 2026ISBIEveryoneCC BY 4.0

Abstract: In computational pathology, effectively capturing visual-language embeddings from extensive pathology image-text pairs has become increasingly crucial for diverse downstream tasks. Although prior studies have fine-tuned models like CLIP using large pathology image-text datasets, these models encounter limitations due to their separate processing of text and images, restricting their ability to capture essential cross-modal relationships critical in pathology. Recent advancements in large language models (LLMs) have led to the development of vision-language models (VLMs) that demonstrate enhanced multimodal capabilities, including stronger language comprehension and reasoning skills compared to CLIP. However, while VLMs show potential for multimodal embedding, previous efforts have primarily focused on text-based tasks, leaving their application to multimodal pathology data largely unexplored. In this work, we introduce a VLM-based framework designed to integrate and align pathology visual-language embeddings within a single model. We validate our framework's effectiveness through cross-modal retrieval on pathology image-caption datasets and zero-shot patch classification across seven pathology image datasets, demonstrating its superiority over CLIP-based models and underscoring its potential for advancing pathology research.