Abstract: In recent years, foundation models, particularly large language models (LLMs), have demonstrated significant improvements across a variety of tasks. One of their most important features is long-context capability, which enables them to generate extended text with high semantic coherence, retrieving relevant information, and handling tasks with substantial amounts of text efficiently. The key to improving long-context performance lies in effective data organization and management strategies that integrate data from multiple domains and optimize the context window during training. Through extensive experimental analysis, we identified three key challenges in designing effective data management strategies that enable the model to achieve long-context capability without sacrificing performance in other tasks: (1) a shortage of long documents across multiple domains, (2) effective construction of context windows, and (3) efficient organization of large-scale datasets. To address these challenges, we introduce DataSculpt, a novel data management framework designed for long-context training. We first formulate the organization of training data as a multi-objective optimization problem, focusing on attributes including the relevance among documents within the same training sequence, the quantity of concatenated instances, individual document integrity, and computational cost. Specifically, our approach utilizes a coarse-to-fine method to optimize training data organization effectively. We begin by clustering the data based on semantic similarity (coarse), followed by a multi-objective greedy search within each cluster to score and concatenate documents into various context windows (fine). We have deployed DataSculpt as the data management backend for long-context training in Baichuan Inc. Extensive experiments with diverse downstream tasks show that DataSculpt enhances the model's long-context performance by an average of 15.73%, while maintaining the general capabilities with a 4.63% improvement.
External IDs:dblp:conf/icde/LuNLPZZCZDCZ25
Loading