Curse of High Dimensionality Issue in Transformer for Long Context Modeling

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to redundant attention computations: while attention weights are often sparse, all tokens consume equal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a supervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a group coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose Dynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.
Lay Summary: Modern large language models often struggle to efficiently process long texts since these models treat each word as equally important, even though many words, like repeated phrases or filler text, contribute little to the final result. This inefficiency wastes computational resources, slows down tasks, and limits real-world applications in areas where long-text analysis is critical. This work theoretically discovers that only a few key words actually influence the final understanding. To address this, we propose Dynamic Group Attention (DGA), which dynamically identifies and prioritizes the most important words while grouping less critical ones. DGA method reduces computational costs while maintaining accuracy, making it feasible to deploy powerful language tools in resource-limited settings like small labs or rural healthcare systems.
Primary Area: Theory->Deep Learning
Keywords: Large language models, Transformer
Submission Number: 2144
Loading