SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
Keywords: data selection
Abstract: Despite the effectiveness of data selection for pretraining and instruction fine-tuning
large language models (LLMs), improving data efficiency in supervised fine-tuning
(SFT) for specialized domains poses significant challenges due to the complexity
of fine-tuning data. To bridge this gap, we introduce an effective and scalable
data selection method for SFT, SmallToLarge (S2L), which trains a small
model, clusters loss trajectories of the examples, and samples from these clusters to
guide data selection for larger models. We prove that during fine-tuning, samples
within the same loss trajectory cluster exhibit similar gradients. Then, we show
that S2L subsets have a bounded gradient error w.r.t. the full data, hence guarantee
convergence to the neighborhood of the optimal solution. We demonstrate through
extensive experiments that S2L significantly improves data efficiency in SFT for
mathematical problem-solving, reducing the training data requirement to just $11$%
of the original MathInstruct dataset to match full dataset performance while
outperforming state-of-the-art data selection algorithms by an average of $4.7$%
across $6$ in- and out-domain evaluation datasets. Remarkably, selecting only 50K
data for SFT, S2L achieves a $32.7$% accuracy on the challenging MATH
benchmark, improving Phi-2 by $16.6$%. In clinical text summarization on the
MIMIC-III dataset, S2L again outperforms training on the full dataset using
only $50$% of the data. Notably, S2L can perform scalable data selection using a
reference model $100\times$ smaller than the target model, proportionally reducing the
computational cost.
Primary Area: Optimization for deep networks
Submission Number: 12338
Loading