Learning to Reason by Failing: Offline RL on Sub-optimal Rollouts Scales Synthetic Data by 8x

Amrith Setlur; Saurabh Garg; Xinyang Geng; Naman Garg; Virginia Smith; Aviral Kumar

Learning to Reason by Failing: Offline RL on Sub-optimal Rollouts Scales Synthetic Data by 8x

Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, Aviral Kumar

Published: 13 Jun 2024, Last Modified: 13 Jun 2024ICML 2024 Workshop AI4MATH PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: synthetic data, large language models, math reasoning, reinforcement learning

Abstract: Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this for reasoning problems via an empirical study, followed by a theoretical formalization. First, we find that while the typical approach of finetuning a model on synthetic correct or *positive* problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner **doubles** the sample efficiency of synthetic data. At the same time, training on model-generated positives can amplify spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize *negative* responses, \ie model-generated responses that are deemed incorrect via final answer checking. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or credit of each intermediate step in the negative response. With this \emph{per-step} scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by **8x**. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits benefits of RL over imitating positive data alone.

Submission Number: 12

Loading