Keywords: data selection, coreset, gradients, instruction tuning, large language model
TL;DR: We design a novel coreset selection method that optimizes instruction tuning by considering both data distribution coverage and batch diversity.
Abstract: Instruction tuning has optimized the specialized capabilities of large language models (LLMs), but it often requires extensive datasets and prolonged training times. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine-tuning. High-quality and diverse pruned data can help models achieve lossless performance at a lower cost. In this paper, we propose \textbf{OptBatch}, a novel data selection method that focuses on the learnability of whole batch data rather than individual samples. OptBatch considers the coverage of the data distribution through stratified sampling and maximizes the relative distance between samples within a batch to enhance diversity. Furthermore, OptBatch utilizes Hessian gradient optimization to guide the selection strategy for subsequent batches. OptBatch effectively captures the intrinsic value of data curation, surpasses previous state-of-the-art methods, and demonstrates robust generalization performance across diverse downstream tasks and models. Extensive experiments reveal that OptBatch training in various pruning rates outperforms full dataset training, reducing computational cost by 20-40\%. Additionally, evaluations using GPT-4 scores and other metrics for multi-turn dialogue, multilingual translation and QA tasks consistently demonstrate OptBatch's optimal performance.
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13511
Loading