Keywords: multimodal learning, self-supervised learning, foundation models, machine learning for health
Abstract: State-of-the-Art (SOTA) medical image classification models are generally pre-trained with large-scale data via self-supervised learning frameworks, to obtain high-quality image representations for radiology-based downstream tasks. The models are typically trained with a large amount of images or image-text pairs, where the text is extracted from the radiology report of the associated images.
Such approaches neglect the rich contextual information available in the patient's Electronic Health Records (EHR), such as vital sign measurements and laboratory test results, which may be highly relevant for some modalities and tasks, such as those involving Chest X-Rays (CXR). Leveraging additional modalities during pre and/or post-training is not straightforward due to the small-scale of paired multimodal datasets. In this paper, we propose a new modular alignment strategy that leverages EHR data to enhance quality of representations of a pre-trained CXR image classification model, without requiring training from scratch. In particular, the framework employs a cross-modal learning objective to capture both global and localized interactions between CXR and EHR features. We ran experiments using the largest publicly-available multimodal dataset, specifically MIMIC-CXR and MIMIC-IV, to propose a new chest X-ray image classification model denoted as MedCAM. We evaluated MedCAM on several publicly available datasets. Our empirical findings show that it significantly outperforms a variety of SOTA baselines in terms of area under the receiver operating characteristic curve. The results highlight the benefit of leveraging EHR data and illustrate the potential of modular learning for efficient multimodal model enhancements.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 24702
Loading