Sieve Attention: Fusing Context-Aware Filtering and Sequential Allocation For Long Sequence

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: sparse attention, long-context, long-context generalization, long-context extrapolation, length generalization, length extrapolation
Abstract: Transformer-based language models struggle with long-context generalization, a problem often rooted in their attention mechanisms. Existing solutions often face a trade-off: sparse attention mechanisms excel at identifying globally relevant content but are permutation-invariant and rely on brittle positional encodings, while sequential mechanisms are inherently order-aware but can be 'short-sighted,' failing to attend to distant yet crucial information. To resolve this dichotomy, we propose Sieve Attention, a novel, two-stage attention mechanism that unifies content-based filtering with sequential allocation. Sieve Attention first employs α-entmax to 'sieve' the entire context, selecting a small candidate set of content-relevant tokens. Subsequently, it applies a sequential, stick-breaking process exclusively on this pre-filtered set to allocate attention with an intrinsic recency bias, thereby eliminating the need for external positional encodings. We theoretically prove that this design allows Sieve Attention to overcome the mutual limitations of its predecessors, demonstrating both immunity to local distractors and inherent order-sensitivity. Extensive experiments on long-context language modeling and retrieval benchmarks show that Sieve Attention significantly outperforms established baselines in length extrapolation and in-context learning. Our work presents a new path toward building more robust long-context models by holistically integrating global content analysis and local sequential reasoning directly within the attention mechanism.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 7177
Loading