Common 7B Language Models Already Possess Strong Math Capabilities

Chen Li; Weiqi Wang; Jingcheng Hu; Yuxing Wei; Han Hu; Zheng Zhang; Houwen Peng; Nanning Zheng

Common 7B Language Models Already Possess Strong Math Capabilities

Chen Li, Weiqi Wang, Jingcheng Hu, Yuxing Wei, Han Hu, Zheng Zhang, Houwen Peng, Nanning Zheng

27 Sept 2024 (modified: 13 Mar 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language model, Math capabilities, Synthetic data, Alignment

TL;DR: Our method effectively increases the scale and diversity of SFT data, which can stimulate the mathematical capabilities of common general pretrained LLMs.

Abstract: It was once believed that mathematical capabilities in language models required either large model scales or extensive math-related data pre-training. However, this paper demonstrates that the small-scale LLaMA-2 7B model already possesses strong mathematical potential. This is evidenced by its impressive scores of 97.6% on GSM8K benchmark and 70% on MATH benchmark, achieved by selecting the oracle response from 1024 generations. Equipped GPT-4 Turbo as an additional verification, LLaMA-2 7B also achieves 91.8% accuracy on GSM8K benchmark. This indicates that the primary issue within current models is the difficulty in consistently eliciting the inherent mathematical capabilities. We find that scaling up synthetic SFT data, which proves to be nearly as effective as real data, can significantly enhance the reliability of generating correct answers. Surprisingly, even with approximately one million samples, we observe no clear performance saturation. And our method is more efficient with large data scale than previous works. This approach achieves an accuracy of 82.4% on GSM8K and 40.1% on MATH using LLaMA-2 7B model, surpassing GPT-3.5 Turbo. Our 70B model even exceeds an early version of GPT-4 on MATH and out-of-domain Hungarian National High School Math Exam. These results demonstrate our method significantly elicits the general mathematical capabilities of language models. Also, we provide insights into scaling behaviors across different reasoning complexities.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11934

Loading