Abstract: Medical image analysis is a hot research topic
because of its usefulness in different clinical applications, such
as early disease diagnosis and treatment. Convolutional neural
networks (CNNs) have become the de-facto standard in medical
image analysis tasks because of their ability to learn complex
features from the available datasets, which makes them surpass
humans in many image-understanding tasks. In addition to
CNNs, transformer architectures also have gained popularity
for medical image analysis tasks. However, despite progress in
the field, there are still potential areas for improvement. This
study uses different CNNs and transformer-based methods with
a wide range of data augmentation techniques. We evaluated
their performance on three medical image datasets from different
modalities. We evaluated and compared the performance of the
vision transformer model with other state-of-the-art (SOTA) pretrained CNN networks. For Chest X-ray, our vision transformer
model achieved the highest F1 score of 0.9532, recall of 0.9533,
Matthews correlation coefficient (MCC) of 0.9259, and ROCAUC score of 0.97. Similarly, for the Kvasir dataset, we achieved
an F1 score of 0.9436, recall of 0.9437, MCC of 0.9360, and
ROC-AUC score of 0.97. For the Kvasir-Capsule (a large-scale
VCE dataset), our ViT model achieved a weighted F1-score
of 0.7156, recall of 0.7182, MCC of 0.3705, and ROC-AUC
score of 0.57. We found that our transformer-based models were
better or more effective than various CNN models for classifying
different anatomical structures, findings, and abnormalities. Our
model showed improvement over the CNN-based approaches and
suggests that it could be used as a new benchmarking algorithm
for algorithm development.
Loading