Can Large Reasoning Models Self-Train?

Can Large Reasoning Models Self-Train?

ICLR 2026 Conference Submission13833 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, reasoning, self-improvement

TL;DR: We examine the promises and pitfalls of self-improvement with LLMs on mathematical reasoning domains

Abstract: Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training — the process where a model learns from its own judgments — can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that even this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm --- prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13833

Loading