TL;DR: We propose a minimal transformer with only self-attention and skip connections.
Abstract: Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.
Lay Summary: Transformers power many AI systems today, but their internal structure is complex and often built without clear explanations for why each part is needed. In this work, we take a step toward making transformers simpler and more understandable. We show that a core reason transformers work well is that they denoise noisy token representations into the corresponding subspaces. Based on this idea, we design a streamlined version of the transformer that uses only attention and skip connections—removing other common components like feedforward layers. Despite being much simpler, our model performs nearly as well as standard transformers on tasks in both language and vision, offering new insights into how these powerful models actually work.
Primary Area: Optimization->Non-Convex
Keywords: transformer, attention, subspace denoising, token representation
Submission Number: 3314
Loading