Optimal patch partitioning in Vision Transformers: Reassessing the notion of model size superiority

24 Nov 2023 (modified: 21 Feb 2024)Submitted to SAI-AAAI2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dynamic inference, neural networks compression, Efficient Transformers
Abstract: Recent advances in Vision Transformers (ViT) have demonstrated notable efficacy in large-scale image recognition by splitting 2D images into a fixed quantity of patches, treating each patch as a distinct token. Generally, augmenting the number of tokens used for image representation enhances prediction accuracy, albeit at the expense of heightened computational demands due to the quadratic complexity of these models. Therefore, to strike a judicious balance between accuracy and computational efficiency, conventional practice involves empirically setting the token count to values such as $14\times14$. This study contends that the optimal token count depends on the inherent characteristics of each individual image. Our empirical investigations show that adapting the patch partitioning to each image "hardness" leads to an operating point in accuracy vs.complexity tradeoff. Consequently, we advocate a dynamic adjustment of the token count based on the unique attributes of each input image. In our experimental study on the ImageNet-1K dataset, we observed a notable phenomenon: a smaller $7\times7$ Transformer model outperforms a larger $14\times14$ counterpart, excelling not only in computational efficiency $(FLOPs)$ but also in top-1 accuracy. This result challenges conventional assumptions regarding the relationship between model size and performance, prompting a reconsideration of scalability in image recognition tasks.
Submission Number: 9