Abstract: The Transformer architecture has quickly become extremely popular as it has achieved state-of-the-art performance in a variety of tasks with a relatively simple design of repeating blocks. Variants of Transformers are now staples in many classification tasks, including language modeling, image classification, and even object detection. The core aspect of the architecture, that is the sequential repeating blocks, has however remained unchanged throughout this time. In this work, we explore an alternative horizontally growing architecture that achieves similar results on the common tasks in which Transformers are proficient, while providing more controllability for parameter expansion due to the model’s shallow nature. We compare with two standard models: BERT for natural language processing and ViT in computer vision. We show that our model achieves comparable results while maintaining very low depth, and in some cases, with just a single layer. To the best of our knowledge, this is the first study that demonstrates the possibility and efficacy of such models. We provide results on some standard benchmarks, i.e., MNLI in case of Natural Language Inference, and CIFAR-10, CIFAR-100 for image classification. We also provide all the source code for our experiments. (The code can be found at https://github.com/akshaybadola/shallow-transformers.
External IDs:dblp:conf/miwai/BadolaPLN25
Loading