EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification

Tahir Hussain, Hayaru Shouno, Abid Hussain, Dostdar Hussain, Muhammad Ismail, Tatheer Hussain Mir, Fang Rong Hsu, Taukir Alam, Shabnur Anonna Akhy

Published: 2025, Last Modified: 12 Jun 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in capturing the global context and lack interpretability, thereby hindering their clinical adoption. This study presents EFFResNet-ViT, a novel hybrid deep learning (DL) model designed to address these challenges by combining EfficientNet-B0 and ResNet-50 CNN backbones with a vision transformer (ViT) module. The proposed architecture employs a feature fusion strategy to integrate the local feature extraction strengths of CNNs with the global dependency modeling capabilities of transformers. The extracted features are further refined through a post-transformer CNN and a global average pooling layer to enhance the classification performance. To improve interpretability, EFFResNet-ViT incorporates Grad-CAM visualization techniques to highlight regions contributing to classification decisions and employs t-distributed stochastic neighbor embedding for feature space analysis, providing insights into class separability. The proposed model was evaluated on two benchmark datasets: brain tumor (BT) CE-MRI for BT classification and a retinal image dataset for ophthalmological diagnosis. EFFResNet-ViT achieved state-of-the-art performance, with accuracies of 99.31% and 92.54% on the BT CE-MRI and retinal datasets, respectively. Comparative analyses demonstrate the superior classification performance and interpretability of EFFResNet-ViT over existing ViT and CNN-based hybrid models. The explainable design of EFFResNet-ViT addresses the critical need for transparency in artificial intelligence-driven medical diagnostics, facilitating its potential integration into clinical workflows to improve decision-making and patient outcomes.