Rethinking Data Selection: The Importance of Coverage over Difficulty in Generative Fine-Tuning

ICLR 2026 Conference Submission20781 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data selection, data efficiency
Abstract: Selecting high-quality training data can reduce computation cost for LLM fine-tuning. Prior data selection methods have developed a variety of scores aiming to reflect what kind of information a data instance can provide to the model, in order to subselect instances for fine-tuning---and a majority of this prior work has focused on scores quantifying difficulty. The intuition in such work is that more difficult examples are more informative, and can therefore lead to more efficient fine-tuning. While data selection based on difficulty has shown promise for smaller classification models, in this work we find that such scores are ineffective for fine-tuning LLMs on generative tasks because their narrow focus on ``difficult'' instances fails to capture the necessary diversity of the input data. We find that in generative tasks, such approaches always fall behind random selection, which our analysis reveals is more representative of the underlying input space—i.e., has better coverage. Motivated by this, we propose a simple clustering-based selection method which selects data that is more representative of the underlying input distribution, enabling selection of smaller subsets of training data for generative tasks. Using a case study on Llama-3-8B (Grattafiori, et al., 2024) and OLMO2-7B (Walsh, et al., 2025), we find that the coverage-based approach performs well above difficulty scoring, yielding performance at or above that of random selection across a set of generative tasks.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 20781
Loading