Efficient Long-range Language Modeling with Self-supervised Causal Retrieval

Xiang Hu; Zhihao Teng; Wei Wu; Kewei Tu

Efficient Long-range Language Modeling with Self-supervised Causal Retrieval

Xiang Hu, Zhihao Teng, Wei Wu, Kewei Tu

13 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: long-range language modeling, Retrieval-based LM, self-supervised learning

TL;DR: An efficient retrieval learning mechanism that enhances long-range language modeling capabilities.

Abstract: Recently, retrieval-based language models (RLMs) have received much attention. However, most of them leverage a pre-trained retriever with fixed parameters, which may not adapt well to causal language models. In this work, we propose Grouped Cross-Attention, a novel module enabling joint pre-training of the retriever and causal LM, and apply it to long-context modeling. For a given input sequence, we split it into chunks and use the current chunk to retrieve past chunks for subsequent text generation. Our innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. By integrating top-$k$ retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens. Our experiments demonstrate that our model achieves superior performance in various tasks against strong baselines, and 100\% accuracy in the needle-in-a-haystack (NIAH) test with a 16M context length.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 490

Loading