Keywords: Mathematical Reasoning
TL;DR: Synthetic instruction tuning data for math reasoning
Abstract: Mathematical reasoning continues to be a critical challenge in large language model (LLM) development, with a significant performance gap between closed-source and open-source efforts, largely due to differences in training data quality and scale. The emergence of frontier open-weight LLMs offers new opportunities to generate high-quality, commercially permissible synthetic data to help bridge this gap. In this paper, we investigate the recently released Llama3.1 family of models to improve open-source math reasoning through synthetically generated supervised finetuning (SFT) data. We conduct ablation studies to optimize design choices for the dataset, such as solution format and teacher model selection, which enhance SFT performance. We also investigate SFT’s robustness to incorrect solutions and find that at large data scales, the model can be robust to as much as 20% noise, suggesting that the simple answer-matching heuristic is sufficient for SFT data selection. Based on these insights, we create the OpenMathInstruct-2 dataset which consists of 14M question-solution pairs (> 600K unique questions), making it nearly eight times larger than any previous such dataset. Finetuning the Llama-3.1-8B-Base using OpenMathInstruct-2 outperforms Llama3.1-8B-Instruct on MATH by an absolute 14.6% (51.9 → 66.5), demonstrating the effectiveness of the dataset. As part of our open-source efforts, we will release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.
Concurrent Submissions: ICLR
Submission Number: 14
Loading