Implementing Vision Transformers from Scratch on Any Dataset!

Implementing Vision Transformers from Scratch on Any Dataset!

MICCAI 2024 MEC Submission3 Authors

08 Aug 2024 (modified: 18 Aug 2024)MICCAI 2024 MEC SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: medical imaging, computer vision, deep learning

TL;DR: Implement Vision Transformers using PyTorch from the ground-up on any dataset of your choice.

Abstract: Vision Transformers (ViTs), introduced in 2020, provide an innovative approach to image classification. They are inspired by the Transformer architecture, which is widely used in natural language processing. ViTs leverage multi-head self-attention mechanisms to effectively compete with convolutional neural networks (CNNs) used de-facto in image recognition. Several implementations of ViTs have demonstrated their strong performance, often utilizing external libraries such as Hugging Face Transformers or Keras. These libraries simplify the process of implementing and fine-tuning models and are invaluable for developers and engineers as they streamline model deployment and customization. However, they may not be ideal for students beginning their deep learning journey as they reduce the model's architectural transparency. For learners, implementing state-of-the-art (SOTA) models from scratch is essential for gaining a deeper understanding of the architecture and its mechanics. This repository and the accompanying notebook provide a clear and comprehensive tutorial for students to construct ViTs using PyTorch from the ground up, allowing them to apply the model to any dataset of their choice.

Video: https://www.youtube.com/watch?v=bLJiCiQMAwU

Website: https://github.com/ssanya942/MICCAI-Educational-Challenge-2024 , https://colab.research.google.com/github/ssanya942/MICCAI-Educational-Challenge-2024/blob/master/Implementing_Vision_Transformers_in_PyTorch_from_Scratch.ipynb

Submission Number: 3

Loading