Attention Needs to Focus: A Unified Perspective on Attention Allocation

Zichuan Fu; Wentao Song; Guojing Li; Yejing Wang; Xian Wu; Yingying Zhang; Yimin Deng; Derong Xu; Hanyu Yan; Jiaxuan Li; Yefeng Zheng; Xiangyu Zhao

Attention Needs to Focus: A Unified Perspective on Attention Allocation

Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yingying Zhang, Yimin Deng, Derong Xu, Hanyu Yan, Jiaxuan Li, Yefeng Zheng, Xiangyu Zhao

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, self-attention, attention sink, representational collapse, positional discrimination, elastic-softmax

TL;DR: We introduce Lazy Attention, which combines head/dimension-wise positional discrimination with Elastic-Softmax to reduce attention overload and underload, producing more focused attention.

Abstract: The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues---representational collapse and attention sink. While existing work has proposed solutions for these problems, they are typically addressed in isolation, lacking a unified analysis of the root cause or a comprehensive solution for both problems. In this paper, we present a unified perspective, arguing that these seemingly disparate issues stem from a single underlying phenomenon: improper attention distribution. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments demonstrate that Lazy Attention resolves attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58\% attention sparsity.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4365

Loading