Keywords: LLM, reasoning, self-improvement
TL;DR: We examine the promises and pitfalls of self-improvement with LLMs on mathematical reasoning domains
Abstract: Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training — the process where a model learns from its own judgments — can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that even this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm --- prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward,
resulting in sudden performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13833
Loading