Hybrid CNN-ViT Models for Medical Image Classification

Published: 01 Jan 2024, Last Modified: 04 Mar 2025ISBI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision Transformers capture long-range global dependencies through attention layers, but lack inductive biases, which poses a challenge for generalization on small datasets, particularly in medical image classification. This study focuses on the classification of chest X-ray images corresponding to different diseases affecting the lungs, such as COVID-19, and Viral and Bacterial Pneumonia. To address the aforementioned challenges, hybrid models were explored, aiming to incorporate some advantages of CNNs into Vision Transformers, enabling the training of models on smaller datasets. So, in this work, we compare the hybrid models pre-trained on ImageNet-1k with the traditional Vision Transformer pre-trained on ImageNet-21k using both a subset and the entire available COVID-QU-Ex dataset, while we also explore training the models from scratch. The results obtained demonstrate the superiority of the hybrid models in terms of accuracy, training time, and dataset size requirements.
Loading