ATLAS: Learning to Optimally Memorize the Context at Test Time

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni

Published: 29 May 2025, Last Modified: 22 Sept 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Transformers are popular sequence modeling backbones, effective for in-context retrieval tasks and scalable learning. However, their quadratic memory and time complexity limit applicability to longer sequences, motivating exploration of alternative modern recurrent neural networks (RNNs). Despite recent successes, these RNNs struggle with long context understanding and extrapolation. We observe these shortcomings stem from three design aspects: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present Atlas, a new long-term memory module with high capacity that learns to memorize historical context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that Atlas surpasses the performance of Transformers and recent modern linear recurrent models. We further evaluate the effect of our solutions on the memory capacity and management of recurrent models, providing insight on the potential reasons for their limited effective context length.