Abstract: Contrastive vision-language pre-training has shown great promise in representation transfer learning and cross-modality learning in the medical field. However, without fully exploiting the intrinsic properties and correlations of multimodal medical data within patient studies, current research fails to explore all the potential of available data, leading to suboptimal performance on representation learning. In this paper, we propose a novel pre-training framework for learning better medical vision-language embedding, oriented on patients' study-level data. Based on the order-agnostic property of radiology report, we adopt a two-stage feature extraction method for more representative textual characterization. Then, by leveraging momentum encoders and memory queues, study-level semantics are explored with three contrastive objectives to provide comprehensive supervision from three perspectives, i.e., cross-modal, multi-modal, and uni-modal, such that the potential information neglected by previous research can be fully exploited. The superiority of the proposed framework is demonstrated by the impressive improvements on four typical downstream tasks, including zero-shot/data-efficient image classification, image segmentation, and cross-modal retrieval.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: In this work, we propose a new vision-language pre-training framework involving study-level information for the medical domain. Our framework can benefit both uni-modal and multi-modal tasks.
Supplementary Material: zip
Submission Number: 4519
Loading