Swelling-ViT: Rethink Data-Efficient Vision Transformer from Locality.

Chuanrui Hu, Bin Chen, Xin Feng, Fudong Nian, Teng Li

Published: 06 Nov 2024, Last Modified: 22 Jan 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: In the domain of computer vision, Transformers have shown great promise, yet they face difficulties when trained from scratch on small datasets, often underperforming compared to convolutional neural networks (ConvNets). Our work highlights Vision Transformers (ViTs) experience a challenge with unfocused attention when trained on limited datasets. This insight has catalyzed the development of our Swelling ViT framework, an adaptive training strategy that initializes ViT with a local attention window, allowing it to expand gradually during training. This innovative approach enables the model to more easily learn local features thereby mitigating the attention dispersion phenomenon. Our empirical evaluation on the Cifar100 dataset with Swelling ViT-B has yielded remarkable results, achieving an accuracy of 82.60% after 300 epochs from scratch and further improving to 83.31% with 900 epochs of training. These outcomes not only signify a state-of-the-art performance but also underscore the Swelling ViT’s capability to effectively address the attention dispersion issue, particularly on small datasets. Moreover, the robustness of our Swelling ViT is affirmed by its consistent performance on the extensive ImageNet dataset, confirming that the strategy does not compromise effectiveness when scaled to larger data regimes. This work, therefore, not only bridges the gap in data efficiency for ViT models but also introduces a versatile solution that can be readily adapted to various domains, regardless of data availability.