Abstract: Optimizing data mixtures for supervised fine-tuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66\% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.
Lay Summary: Training large language models to handle various tasks like solving math problems or writing code requires combining different types of data. But figuring out the right mix of data is difficult—too much of one type might make the model good at math but bad at coding, and testing every possible mix is time-consuming.
We developed a method that automatically finds the best data mix. It uses small experiments to predict how adding more of one type of data (e.g., coding examples) helps the model learn other tasks (e.g., math) indirectly.
Our method works as well as manually testing many data combinations but is much faster. It ensures the model performs well across all tasks without favoring one over another. For example, it helped build a medical chatbot that learned better by mixing general instructions with medical data instead of using only medical examples. This makes language model training smarter and more reliable for real-world use.
Primary Area: Deep Learning->Large Language Models
Keywords: Large language models, supervised fine-tuning, data mixing
Submission Number: 13907
Loading