Vertical Attention: Automatic Exploration of Inter-Layer Connections in Transformer-based Language Models
Keywords: Layerwise attention, Inter-layer connection, Transformer
TL;DR: This work proposes a method to automatically learn inter-layer connectivity in Transformers by adding attention-based routing modules at each layer.
Abstract: The Transformer architecture has become the de facto standard across natural language processing and other modalities, demonstrating strong generality and performance. However, the conventional design, which stacks attention and feed-forward blocks sequentially, is not guaranteed to be optimal. More expressive inter-layer connectivity patterns, such as parallelization or skip connections across distant layers, may exist but are difficult to discover through manual exploration. In this work, we propose a method to automatically learn inter-layer network paths during training. Our approach introduces a small number of parameterized attention modules at the beginning of each layer, which are interpreted as inter-layer connections, and optimizes these paths end-to-end. Through large-scale experiments with LLaMA-style models ranging from 50M to 300M parameters pre-trained on 20B tokens, we show that our method consistently achieves lower pretraining loss than vanilla Transformers and competitive baselines. Analysis of the learned attention maps reveals intriguing patterns, such as strong interactions from lower to higher layers and attention sparsity in the middle layers. Furthermore, logit lens analysis demonstrates that our Transformer almost entirely postpones output prediction until the final layer, exhibiting fundamentally different internal behavior from that of a vanilla Transformer. Finally, we validate the effectiveness of the proposed architecture in downstream tasks in a few-shot in-context learning environment, confirming its applicability and utility.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24363
Loading