Transformer Instability in Long Sequence Training: The Underestimated Role of Short-Range Dependencies

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformers, Self-Attention, Training Instability, Long-Sequence Training, Short-Range Dependencies
Abstract: Transformer language models have driven remarkable progress across diverse fields, including natural language processing, speech processing, and computer vision. However, despite extensive research, transformers remain prone to training instability on long sequences, often manifesting as sudden spikes or divergence in the training loss during a run. In this work, we identify a source of this instability: self-attention’s limited capacity to capture short-range dependencies - particularly in tasks such as language modeling, where most tokens depend heavily on their immediate neighbors. This limitation leads to rapid growth of the self-attention's logits during long-sequence training, ultimately destabilizing optimization. To address this, we propose augmenting the standard architecture with several local (short-range) attention heads alongside the full (long-range) attention heads. The local heads explicitly capture short-range dependencies, while the full heads preserve long-range context. This composed self-attention - termed Long Short-attention (LS-attention) - stabilizes training by mitigating logit explosion. Across a wide range of experiments, we demonstrate that long-sequence training triggers logit explosion for multi-head self-attention (MHSA), whereas LS-attention effectively prevents it. Additionally, LS-attention makes transformer models more efficient, reducing inference latency by up to $44$\% compared to equivalent state-of-the-art MHSA implementations.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9127
Loading