TL;DR: A novel attention mechanism with length generalization ability achieves perfect accuracy in passkey retrieval over 16M context length.
Abstract: Despite the success of Transformers, handling longer contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention, which often requires post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 $\times$ the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-$k$ relevant past chunks for subsequent text generation.
Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner, which adapts better to causal language models.
Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference.
Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is $1000 \times$ the training length.
Lay Summary: Enabling machines to possess human-like long-term memory is a crucial step toward creating personalized assistants capable of recalling all relevant historical interactions. However, achieving this remains an unsolved challenge. If we equate ultra-long-term memory to infinite context, the key difficulty lies in how models extrapolate from limited pre-trained contexts to vastly longer ones, while still being capable of random-accessing all contexts. This paper introduces an effective sparse attention mechanism that achieves 1000x length extrapolation, offering a promising prototype for enabling machines to develop permanent memory.
Link To Code: https://github.com/ant-research/long-context-modeling
Primary Area: Deep Learning->Attention Mechanisms
Keywords: long-context language modeling, retrieval-based LM, length generalization, attention mechanism
Submission Number: 2376
Loading