Large Language Models do Not Make Complete Use of Math Reasoning Data

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Math Reasoning, Fine-Tuning
Abstract: In deep learning, increasing dataset size has been shown to improve the performance of deep neural networks. However, it is unclear if these models are able to make complete use of the data that they are trained on. Understanding this is especially important in the current large language model era, where data scarcity has become a pressing issue. We discover that when performing fine-tuning on mathematical reasoning tasks, adding more training data causes the model to incorrectly answer a large portion of previously correctly answered test samples. This remains true even with popular test-time scaling techniques, which can iron out inconsistencies in model predictions. To better understand this phenomenon, we show both empirically and theoretically that models trained using Supervised Fine-Tuning and Reinforcement Learning are incapable of making complete use of the data that they are trained on, where models trained on the same data learn very different functions across different random seeds, exhibiting high predictive multiplicity. This work contains novel insights that can aid in improving a model's ability to effectively scale its performance with more data.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3498
Loading