Keywords: long context training; continue training; dataset;
TL;DR: We found long distance referral important to long context training, and design data pipeline to scale up constructing such data
Abstract: Training large language models for long context understanding faces the challenge of data shortage.
Previous data engineering approaches mechanically concatenate short documents, which may create many pseudo long documents but raise concerns about data quality.
In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, LongPack, to scale
such data.
We found that long distance referrals, which occur in natural long documents, are crucial for long-context training.
However, simply concatenating short documents does not reliably generate these relations.
We further show that the density of long-distance referrals, which is higher in longer documents, has a key role in training efficiency, making previous upsampling methods suboptimal.
To enrich long documents, we propose LongPack, a data pipeline that constructs long documents by packing shorter ones based on referral relationships.
Specifically, for web pages, which are the primary source for language model training, we found hyper-link a native signal for such a relation.
By packing web pages through their hyper-link connection, we can create longer, high-quality documents.
Our experiments demonstrate that LongPackis highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents.
Furthermore, the constructed documents have a ‘near-natural’ quality as innate long documents for long context training, reaching a 32.7% higher score than previous state-of-the-art methods.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9089
Loading