Prune, Then Select: Select High-Quality, Important, and Diverse Data Using Training Trajectories

Prune, Then Select: Select High-Quality, Important, and Diverse Data Using Training Trajectories

ICLR 2026 Conference Submission17221 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data-efficient training, data selection

TL;DR: We propose a new method named PS, containing a Prune step and a Select step, to ensure selecting a high-quality, important, and diverse subset by effectively utilizing the training trajectories of data samples collected from a small proxy model.

Abstract: The rapid expansion of instruction datasets not only escalates the computational cost of instruction fine-tuning but also brings data-related challenges, such as the presence of noisy or low-quality samples and the redundancy caused by duplicate or highly similar instances. To address these issues, data selection methods have been proposed to reduce training expenses while preserving, or even enhancing, model performance through fine-tuning on an appropriately chosen subset. In this paper, we propose a new method named $\textbf{PS}$, containing a $\underline{\textbf{P}}$rune step and a $\underline{\textbf{S}}$elect step, to ensure selecting a high-quality, important, and diverse subset by efficiently utilizing the training trajectories of data samples collected from a small proxy model. Specifically, in the $\underline{\textbf{P}}$rune step, we prune low-quality data that do not exhibit a downward trend in their $\textbf{loss trajectories}$, as these samples may negatively impact the model training. In the $\underline{\textbf{S}}$elect step, we introduce the concept of $\textbf{the learning trajectory}$ (i.e., the loss reduction trajectory or the loss reduction rate trajectory), which provides a better representation of the model's learning progress on each data sample, and use these $\textbf{learning trajectories}$ as sample features to cluster the retained samples from the $\underline{\textbf{P}}$rune step. A balanced selection is then performed across all clusters within a fixed budget. We validate $\textbf{PS}$ on the MathInstruct dataset (262K) with the open-source model suite Pythia by comparing it against two categories of data selection methods: importance-based and diversity-based methods. Experimental results show that our $\textbf{PS}$ consistently outperforms all baseline methods across budget constraints of 30K (11.5\%), 50K (19.1\%), and 100K (38.2\%). Notably, $\textbf{PS}$ achieves superior performance with less than 40\% of the data compared to the model trained on the full dataset.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17221

Loading