Multi-Modal Foundation Models for Computational Pathology: A Survey

Published: 06 Oct 2025, Last Modified: 06 Oct 2025NeurIPS 2025 2nd Workshop FM4LS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computational Pathology, Foundation Model, Multi-Modality
Abstract: Foundation models have become a key paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. Early work centered on uni-modal models trained solely on visual data, but recent advances highlight the potential of multi-modal approaches that integrate textual reports, structured knowledge, and molecular profiles. In this survey, we review 32 multi-modal foundation models built primarily on hematoxylin and eosin (H&E) whole-slide images (WSIs) and tile-level representations, categorizing them into vision–language, vision–knowledge graph, and vision–gene expression paradigms, with vision–language models further divided into non-LLM- and LLM-based variants. We also analyze 28 datasets, grouped into image–text pairs, instruction datasets, and image–other modality pairs, and summarize downstream tasks, training and evaluation strategies, and future challenges. This survey provides a comprehensive resource for advancing AI in pathology.
Submission Number: 12
Loading