UltraMAE: Multi-modal Masked Autoencoder for Ultrasound Pre-training

Published: 06 Jun 2024, Last Modified: 06 Jun 2024MIDL 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Modal, Foundation Model, Classification
Abstract: Pre-training on a large dataset such as ImageNet followed by supervised fine-tuning has brought success in various deep learning-based tasks. However, the modalities of natural images and ultrasound images have considerable differences, making pre-training on natural images ineffective for ultrasound-related tasks. In this paper, we introduce a unified masking-based model for both ultrasound images and videos that learns better visual representation than the network with single-modality representations. This is the first large-scale generalized ultrasound pre-training network that simultaneously utilizes 100,000+ videos and images of different parts of the human anatomy such as the liver, bones, heart, thyroids, nerves, etc, making the network an effective benchmark pretrained model for any ultrasound-specific downstream tasks. We propose a novel method for ultrasound image analysis that utilizes an ultrasound-specific confidence map to guide low-level representation learning through masked feature acquisition. Our pre-trained network has demonstrated remarkable efficacy and versatility in tackling both classification and segmentation tasks across a range of ultrasound pathologies, highlighting its potential for widespread adoption and impact in the ultrasound field. In addition, we show that our pre-training model can be leveraged to learn efficiently with a small number of labeled ultrasound images.
Latex Code: zip
Copyright Form: pdf
Submission Number: 5
Loading