Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

ICLR 2025 Conference Submission12747 Authors

28 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Instruction Tuning, Data Selection, Influence Estimation
TL;DR: We propose BIDS, an improved influence-based data selection algorithm for LLM instruction tuning, to promote balanced learning of diverse capabilities.
Abstract: Selecting appropriate training data is crucial for successful instruction fine-tuning, which aims to (1) elicit strong capabilities from pretrained large language models (LLMs), and (2) achieve balanced performance across a diverse range of tasks. Algorithms based on influence estimation have shown promise in achieving (1) through estimating the contribution of each training example to model's prediction on a downstream task, but they often struggle with (2). Through systematic experiments, we attribute their underperformance to an inherent bias---certain tasks intrinsically have greater influence than others. Directly comparing influence scores across different tasks would thus bias the selected data towards these tasks, hurting the LM's performance not only on other capabilities, but also, surprisingly, on the tasks for which the selected data has high influence. To address this issue, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data with respect to each downstream task at an instance level. It then applies an iterative process to further balance the selection of influential training data. At each step, BIDS selects the training example that bears the highest influence on the most underrepresented capability by the currently selected data. We perform comprehensive experiments using both Llama-3 and Mistral-v0.3 on seven evaluation benchmarks spanning five diverse capabilities. Results demonstrate that BIDS consistently outperforms both state-of-the-art influence-based data selection algorithms and other non-influence-based selection frameworks under various budgets. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance across different tasks. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12747
Loading