Keywords: reasoning, multi hop, memory, large language models, generalization
TL;DR: A memory-augmented LLM shows robust multi-hop reasoning on a challenging long-context benchmark
Abstract: Recent benchmarks suggest that there remains significant room to improve large language models’ ability to robustly reason across facts distributed in extremely long documents. In this work, we propose MemReasoner, a new memory-augmented LLM architecture that is trained to perform temporal reasoning, along with multiple computational steps, over the context stored in the memory. Experiments show that MemReasoner trained on the core reasoning facts generalizes better, when compared to off-the-shelf large language models and existing recurrent models, on a test distribution where the required facts are scattered across long natural text up to 128k tokens. Further, MemReasoner demonstrates robust reasoning performance relative to the baselines, when the answer distribution in test samples differs from that in the training set.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10910
Loading