Abstract: Traditional Convolutional Neural Networks (CNNs) often struggle with capturing intricate spatial relationships and nuanced patterns in diverse datasets. To overcome these limitations, this work pioneers the application of Vision Transformer (ViT) models which have gained significant attention in the field of computer vision for their ability to capture long-range dependencies in images through self-attention mechanisms. However, training large-scale ViT models with a massive number of parameters poses computational challenges. In this paper, we present an optimized approach for training ViT models that leverages the parallel processing capabilities of Graphics Processing Units (GPUs) and optimizes the computational workload distribution using multi-threading. The proposed model is trained and tested on the CIFAR-10 dataset and achieved an outstanding accuracy of 99.92% after 100 epochs. The experimental results demonstrate the effectiveness of our approach in optimizing training efficiency compared to existing methods. This underscores the superior performance of ViT models and their potential to revolutionize image classification tasks.
External IDs:dblp:conf/ieeecai/LedetKRRS24
Loading