Abstract: In this paper, we propose a new data synthesis method called \textbf{LogicPro}, which leverages LeetCode-style algorithm \underline{Pro}blems and their corresponding \underline{Pro}gram solutions to synthesize Complex \underline{Logic}al Reasoning data in text format.
First, we synthesize complex reasoning problems through source algorithm problems and test cases.
Then, standard answers and intermediate variable outputs are obtained for each problem based on standard python solutions and test cases.
Finally, with the guidance of code intermediate variables, we synthesize the text reasoning process for each reasoning problems.
Through this method, we can synthesize data that is difficult, scalable, effective, and comes with golden standard answers and high-quality reasoning processes.
As a result, with our 540K synthesized dataset constructed solely from 2,360 algorithm problems, our approach achieves significant improvements in multiple models for the datasets \textit{BBH$^{27}$}, \textit{LogicBench}, \textit{DROP}, \textit{AR-LSAT}, and \textit{GSM8K}, etc. outperforming a wide range of existing reasoning datasets.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: reasoning, NLP datasets,
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 6969
Loading