Keywords: Transformer, Learning theory, Initialization, ConvMixer, Attention map
Abstract: The application of Vision Transformers (ViTs) to new domains where an inductive bias is known but only small datasets are available to train upon is a growing area of interest.
However, training ViT networks on small-scale datasets poses a significant challenge.
In contrast, Convolutional Neural Networks (CNNs) have an architectural inductive bias enabling them to perform well on such problems.
In this paper, we propose that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT.
Specifically, based on our theoretical findings that the convolutional structures of CNNs allow random impulse filters to achieve performance comparable to their learned counterparts, we design a ``structured initialization'' for ViT with optimization.
Unlike conventional initialization methods for ViTs, which typically (1) rely on empirical results such as attention weights in pretrained models, (2) focus on the distribution of the attention weights, resulting in unstructured attention maps, our approach is grounded in a solid theoretical analysis, and builds structured attention maps.
This key difference in the attention map empowers ViTs to perform equally well on small-scale problems while preserving their structural flexibility for large-scale applications.
We show that our method achieves significant performance improvements over conventional ViT initialization methods across numerous small-scale benchmarks including CIFAR-10, CIFAR-100, and SVHN, while maintaining on-par if not better performance on large-scale datasets such as ImageNet-1K.
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1680
Loading