Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Selection, Data Quality, Data Diversity, Pre-training
Abstract: High-quality pre-training data is a decisive factor for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, empirical studies have shown that directly selecting top-scored data often degrades downstream performance, and sampling from a broader range is required to recover results. The above non-monotonicity between the dataset scores and the downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated evaluation metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. To this end, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, a method to preserve both quality and diversity during high-quality data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The resulting multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2\% inter-dimension overlap, confirming the orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on multiple downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11233
Loading