Keywords: vision transformers, medical image analysis
Abstract: Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis, pushing the state-of-the-art in classification, detection and segmentation tasks. Recently, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding impressive levels of performance in the natural image domain, while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore whether it is feasible to switch to transformer-based models in the medical imaging domain as well, or if we should keep working with CNNs - can we trivially replace CNNs with transformers? We consider this question in a series of experiments on several standard medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers are on par with CNNs when pretrained on ImageNet in both classification and segmentation tasks. Further, ViTs often outperform their CNN counterparts when pretrained using self-supervision.
One-sentence Summary: We investigate whether vision transformers can be used as drop-in replacements for CNNs in medical image analysis tasks
Supplementary Material: zip
17 Replies
Loading