Keywords: Multi-Modal Learning, Self-Supervised Learning, Out-of-Distribution, Radiology
Abstract: Although human’s ability to visually understand the structure of the World plays a
crucial role in perceiving the World and making appropriate decisions, human perception
does not solely rely on vision but amalgamates the information from acoustic, verbal, and
visual stimuli. An active area of research has been revolving around designing an efficient
framework that adapts to multiple modalities and ideally improves the performance of existing
tasks. While numerous frameworks have proved effective on natural datasets like
ImageNet, a limited number of studies have been carried out in the biomedical domain.
In this work, we extend the available frameworks for natural data to biomedical data by
leveraging the abundant, unstructured multi-modal data available as radiology images and
reports. We attempt to answer the question, ”For multi-modal learning, self-supervised
learning and joint learning using both learning strategies, which one improves the visual
representation for downstream chest radiographs classification tasks the most?”. Our experiments
indicated that in limited labeled data settings with 1% and 10% labeled data,
the joint learning with multi-modal and self-supervised models outperforms self-supervised
learning and is at par with multi-modal learning. Additionally, we found that multi-modal
learning is generally more robust on out-of-distribution datasets.
4 Replies
Loading