Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Xiang Hu; Zhihao Teng; Jun Zhao; Wei Wu; Kewei Tu

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A novel attention mechanism with length generalization ability achieves perfect accuracy in passkey retrieval over 16M context length.

Abstract: Despite the success of Transformers, handling longer contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention, which often requires post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 $\times$ the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-$k$ relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner, which adapts better to causal language models. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is $1000 \times$ the training length.

Lay Summary: Enabling machines to possess human-like long-term memory is a crucial step toward creating personalized assistants capable of recalling all relevant historical interactions. However, achieving this remains an unsolved challenge. If we equate ultra-long-term memory to infinite context, the key difficulty lies in how models extrapolate from limited pre-trained contexts to vastly longer ones, while still being capable of random-accessing all contexts. This paper introduces an effective sparse attention mechanism that achieves 1000x length extrapolation, offering a promising prototype for enabling machines to develop permanent memory.

Link To Code: https://github.com/ant-research/long-context-modeling

Primary Area: Deep Learning->Attention Mechanisms

Keywords: long-context language modeling, retrieval-based LM, length generalization, attention mechanism

Submission Number: 2376

Loading