Keywords: multi-head attention, theoretical approach to efficient transformers
Abstract: Transformers have reshaped machine learning by leveraging attention to capture complex dependencies, driving major advances across domains. Their success has fueled the belief that ever-larger models are required for strong performance. In this paper, we challenge this assumption by showing that many transformers are unnecessarily oversized. We present a theoretical principle that redefines the role of multi-head attention, demonstrating that multiple heads improve the conditioning of the Jacobian of the attention block. Guided by this insight, we redesign popular architectures with more heads and fewer layers. This trade-off reduces parameter counts by up to 30--50\% while preserving accuracy, yielding leaner yet equally effective models. We validate our approach across a range of transformer-based architectures and scales, showing consistent benefits on tasks in computer vision (ImageNet-1k) and language and sequence modeling (GLUE, TinyStories, and the Long-Range Arena benchmark).
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 17693
Loading