ACCELERATE SCALING OF LLM FINETUNING VIA QUANTIFYING THE COVERAGE AND DEPTH OF INSTRUCTION SET

Chengwei Wu; Li Du; hanyu Zhao; Yiming Ju; Jiapu Wang; Tianyu Chen; Haoyi Zhou

ACCELERATE SCALING OF LLM FINETUNING VIA QUANTIFYING THE COVERAGE AND DEPTH OF INSTRUCTION SET

Chengwei Wu, Li Du, hanyu Zhao, Yiming Ju, Jiapu Wang, Tianyu Chen, Haoyi Zhou

18 Sept 2025 (modified: 09 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: instruction set refinement, instruction data selection

TL;DR: We show that semantic coverage and information depth are key to SFT scalability, and propose ILA, a simple model-agnostic data selection method that enables accelerated scaling with compact, high-value subsets.

Abstract: Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11197

Loading