Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

ICLR 2026 Conference Submission16264 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Model; Data Curation

Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, much readily available long-text data does not genuinely require extended context, as most spans can be predicted with only short-range context while only a small fraction truly depends on long-distance dependencies, making it important to identify and select training data with stronger long-context dependencies. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER. Moreover, extensive analyses further confirm that different types of text segments vary in their reliance on extended context, highlighting which data truly benefits from long-context modeling.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 16264

Loading