Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Simon Yu; Liangyu Chen; Sara Ahmadian; Marzieh Fadaee

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: diversity, data selection, training efficiency, iterative refinement

TL;DR: We demonstrate that k-means clustering is still effective for selecting an diverse subset of instructional data; and we propose an iterative clustering method to further boost the performance.

Abstract: Finetuning large language models on instruction data is an important step in enriching the knowledge learned during pre-training and improving instruction-following capabilities. As the number of instruction datasets continues to grow, selecting the right data to achieve optimal results becomes increasingly important. In this work, we ask a prominent question: How can we determine the optimal subset of data for effective training? While much of the existing research primarily emphasizes local criteria, such as instance quality, for subset selection, we argue that a global approach focused on data diversity is more critical. Our approach utilizes $k$-means clustering to ensure that the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, with the importance and sampling weight of each cluster being reassessed in every training iteration. This method allows us to reduce the effect of outliers and automatically filter out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7\% increase over the random selection and a 3.8\% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is submitted as supplementary materials.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8133

Loading