Multi-modal Denoising Diffusion Pretraining for Whole-Slide Image Classification

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Whole-slide image (WSI) classification methods play a crucial role in tumor diagnosis. Most of them use hematoxylin and eosin (H&E) stained images, while Immunohistochemistry (IHC) staining provides molecular markers and protein expression information that highlights cancer regions. However, obtaining IHC-stained images requires higher costs in practice. In this work, we propose a multi-modal denoising diffusion pre-training framework that harnesses the advantages of IHC staining to learn visual representations. The framework is trained with the H&E-to-IHC re-staining task and IHC-stained image reconstruction task, which helps capture the structural similarity and staining difference between two image modalities. The trained model can then provide IHC-guided features, by taking only H&E-stained images as inputs. Besides, we build a new class-constraint constrastive loss to achieve the semantic consistency between dual-modal features from our pre-training framework. To integrate with WSI classifiers based on multi-instance learning, we further propose a bag feature augmentation strategy to extend bags with the features extracted by our pre-trained model. Experimental results on three datasets show that our pre-training framework effectively improves WSI classification and surpasses the state-of-the-art pre-training approaches. Code and model are released via https://github.com/lhaof/MDDP
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Generation] Generative Multimedia, [Content] Media Interpretation, [Content] Multimodal Fusion
Relevance To Conference: We propose a novel multi-modal denoising diffusion based pre-training framework for histopathology image classification. The framework is pre-trained with cross-modal image generation (translating HE-stained images to IHC-stained images) and uni-modal image reconstruction tasks to learn representations for different downstream tasks of histopathology image understanding. Thus, our method can be viewed as a ‘multi-modal foundation model’ based on histopathology images. Besides, our multi-modal pre-training method is based on generative model, the denoising diffusion model, so our work is relevant to ‘Generative Multimedia’. Moreover, we fuse multiple image modalities to train the proposed framework, and use it to provide cross-modal features for image understanding tasks with missing modality. Therefore, our work is also related to ‘Multimedia Fusion’ and ‘Multimedia Interpretation’.
Supplementary Material: zip
Submission Number: 746
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview