NExtLong: Toward Effective Long-Context Training without Long Documents

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
Lay Summary: Most large language models struggle to understand information spread across very long documents, partly because high-quality training data of that length is rare. Existing methods try to create longer inputs by simply stitching together short texts, but this often leads to irrelevant or incoherent content. To solve this, we introduce NExtLong, a new way to train models on longer contexts. Instead of just concatenating texts, we deliberately insert carefully chosen "distracting" content between important parts of a document. This forces the model to learn how to identify and focus on meaningful information—even when it's buried in noise. Our method mimics the challenges real long documents present, helping models better handle long-range dependencies. In testing, NExtLong significantly outperformed previous approaches on multiple long-document tasks. This suggests it could help train more capable language models without needing massive amounts of real long-form data.
Link To Code: https://github.com/caskcsg/longcontext/tree/main/NExtLong
Primary Area: Deep Learning->Large Language Models
Keywords: Long-Context Model, Synthetic Data
Submission Number: 4876
Loading