Backtracking Mathematical Reasoning of Language Models to the Pretraining Data
Keywords: pretraining data; math reasoning; data attribution
Abstract: In-context learning and chain-of-thought prompting have demonstrated surprising performance improvements on mathematical reasoning benchmarks. Therefore, understanding the underlying factors enabling these capabilities is crucial. However, the specific aspects of pretraining data that equip models with mathematical reasoning capabilities remain largely unexplored and are less studied systematically. In this study, we identify subsets of model pretraining data that contribute to math reasoning ability of the model, and evaluate it on several mathematical operations (e.g. addition, multiplication) and tasks (e.g. the asdiv dataset). We measure the importance of such subsets by continual training of the model on pretraining data subsets, and then we quantify the change in performance on the mathematical benchmark to assess their importance. If a subset results in an improved performance, we conjecture that such subset contributes to a model's overall mathematical ability. Our results unveil that while training on math-only data contributes to simple arithmetic abilities, it does not solely explain performance on more complex reasoning abilities like chain-of-thought reasoning. We also find that code data contributes to chain-of-thought reasoning while reducing the arithmetic performance.
Submission Number: 43