Keywords: data-efficient training, data selection
TL;DR: We propose a new method named PS, containing a Prune step and a Select step, to ensure selecting a high-quality, important, and diverse subset by effectively utilizing the training trajectories of data samples collected from a small proxy model.
Abstract: The rapid expansion of instruction datasets not only escalates the computational cost of instruction fine-tuning but also brings data-related challenges, such as the presence of noisy or low-quality samples and the redundancy caused by duplicate or highly similar instances. To address these issues, data selection methods have been proposed to reduce training expenses while preserving, or even enhancing, model performance through fine-tuning on an appropriately chosen subset. In this paper, we propose a new method named $\textbf{PS}$, containing a $\underline{\textbf{P}}$rune step and a $\underline{\textbf{S}}$elect step, to ensure selecting a high-quality, important, and diverse subset by efficiently utilizing the training trajectories of data samples collected from a small proxy model. Specifically, in the $\underline{\textbf{P}}$rune step, we prune low-quality data that do not exhibit a downward trend in their $\textbf{loss trajectories}$, as these samples may negatively impact the model training. In the $\underline{\textbf{S}}$elect step, we introduce the concept of $\textbf{the learning trajectory}$ (i.e., the loss reduction trajectory or the loss reduction rate trajectory), which provides a better representation of the model's learning progress on each data sample, and use these $\textbf{learning trajectories}$ as sample features to cluster the retained samples from the $\underline{\textbf{P}}$rune step. A balanced selection is then performed across all clusters within a fixed budget. We validate $\textbf{PS}$ on the MathInstruct dataset (262K) with the open-source model suite Pythia by comparing it against two categories of data selection methods: importance-based and diversity-based methods. Experimental results show that our $\textbf{PS}$ consistently outperforms all baseline methods across budget constraints of 30K (11.5\%), 50K (19.1\%), and 100K (38.2\%). Notably, $\textbf{PS}$ achieves superior performance with less than 40\% of the data compared to the model trained on the full dataset.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17221
Loading