SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Yu Yang; Siddhartha Mishra; Jeffrey N Chiang; Baharan Mirzasoleiman

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data selection

Abstract: Despite the effectiveness of data selection for pretraining and instruction fine-tuning large language models (LLMs), improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which trains a small model, clusters loss trajectories of the examples, and samples from these clusters to guide data selection for larger models. We prove that during fine-tuning, samples within the same loss trajectory cluster exhibit similar gradients. Then, we show that S2L subsets have a bounded gradient error w.r.t. the full data, hence guarantee convergence to the neighborhood of the optimal solution. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data requirement to just $11$% of the original MathInstruct dataset to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of $4.7$% across $6$ in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a $32.7$% accuracy on the challenging MATH benchmark, improving Phi-2 by $16.6$%. In clinical text summarization on the MIMIC-III dataset, S2L again outperforms training on the full dataset using only $50$% of the data. Notably, S2L can perform scalable data selection using a reference model $100\times$ smaller than the target model, proportionally reducing the computational cost.

Primary Area: Optimization for deep networks

Submission Number: 12338

Loading