Keywords: vision transformers, CIFAR-10, optimization dynamics, inductive bias, generalization, controlled experiments, training dynamics, small-data learning
TL;DR: Controlled experiments on CIFAR-10 show that optimization strategies, rather than architectural scaling, are the primary driver of Vision Transformer generalization in small-data regimes.
Abstract: Vision Transformers (ViTs) perform competitively on large-scale vision benchmarks but consistently underperform convolutional models when trained from scratch on small datasets. We present a controlled empirical study of ViTs trained from scratch on CIFAR-10, systematically isolating the effects of data diversity, model capacity, regularization, and optimization. Across four progressively refined ViT variants, we find that architectural scaling and data augmentation yield limited gains, whereas optimization strategies—specifically learning rate warmup and cosine decay combined with stronger regularization—produce substantial improvements in generalization. Our results indicate that ViT failure in small-data regimes is governed primarily by optimization dynamics rather than architectural limitations.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 20
Loading