Keywords: NLP, Large language models, long seqeunce
Abstract: The quadratic complexity of standard attention poses a significant bottleneck for Large Language Models (LLMs) in processing long sequences and integrating information across distant contexts. To overcome this limitation, we propose Summary Token-guided Attention and Routing (STAR), a novel and efficient attention mechanism that selectively retrieves and refines relevant context through a three-stage, coarse-to-fine process. In the Intra-Chunk Abstraction stage, a special summary token appended to each chunk captures local semantics via attention. In the Inter-Chunk Routing stage, a query attends to all summary tokens to identify the most relevant chunks. Finally, the Token-Level Refinement stage applies fine-grained attention over the original tokens within those chunks to enrich contextual representation. Compared to global dense attention, STAR significantly reduces computational cost as input length grows, while preserving the model’s ability to reason over long-range dependencies. Experiments on challenging long-context benchmarks show that STAR consistently outperforms existing approaches to enhance long-text processing capabilities of LLMs.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10621
Loading