Abstract: Selecting key data subsets for model training is an effective way to improve training efficiency. Existing methods generally utilize a well-trained model to evaluate samples and select crucial subsets, ignoring the fact that the sample importance changes dynamically during model training, resulting in the selected subset only being critical in a specific training epoch rather than a changing training phase. To address this issue, we attempt to evaluate the significant changes in sample importance during dynamic training and propose a novel data selection method to improve model training efficiency. Specifically, the temporal changes in sample importance are considered from three perspectives: (i) loss, the difference between the predicted labels and the true labels of samples in the current training epoch; (ii) instability, the dispersion of sample importance in the recent training phase; and (iii) inconsistency, the comparison of the changing trend in the importanceofanindividualsamplerelativetotheaverageimportance of all samples in the recent training phase. Extensive experiments demonstrate that dynamic data selection can reduce computational costs and improve model training efficiency. Additionally, we find that the difficulty level of the training task influences the data selection strategy.
Loading