PSYCHE: Practical Synthetic Math Data Evolution

Published: 2025, Last Modified: 21 Jan 2026NLPCC (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Synthetic data has been verified to be effective in enhancing LLM’s math capability. However, there is seldom research on systematically introducing the practical synthetic data generation framework for math-specific pre-training, which requires much larger-scale synthetic data. In this work, we propose a novel Practical synthetic math data evolution (PSYCHE) framework, which is simple, effective, and easy to deploy and evolve in the industry. Specifically, PSYCHE values the hardness, accuracy, and diversity in synthetic data generation. It relies on a hardness controller to focus more on the questions that the currently trained LLM cannot address well in synthetic question generation, iteratively evolving LLMs with harder math data. To maintain relatively high question diversity and answer accuracy of harder questions, multiple strategies such as diversity-aware multi-to-multi question generation, multi-lingual data transfer, retrieval-augmented generation enhanced answer generation, and multi-model verification are proposed and verified to be effective. In total, PSYCHE builds nearly 300 million math samples for math-specific continue pre-training. Extensive experiments confirm that PSYCHE achieves impressively satisfactory results on math of various languages and hardness levels.
Loading