What Can You Do When You Have Zero Rewards During RL?

Jatin Prakash; Anirudh Buvanesh

What Can You Do When You Have Zero Rewards During RL?

Jatin Prakash, Anirudh Buvanesh

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: zero rewards, data centric RL, exploration

TL;DR: Under zero outcome rewards on a simple graph search task, recently proposed methods fail. A simple data-centric intervention works suprisingly well.

Abstract: Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks. However, its success often depends on the base model occasionally sampling correct solutions. When no correct solutions are sampled, training encounters a zero-reward barrier where learning stalls due to zero gradients. We study this scenario through the graph search task and evaluate recent methods that incorporate desirable components such as dense rewards, diversity incentives, and improved credit assignment. Our experiments show that none of these approaches overcome the zero-reward barrier if the base model never produces a correct answer. In contrast, we find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward. Importantly, this succeeds without modifying the RL algorithm itself. Because official implementations of several baselines were unavailable, we developed our own, which allowed us to conduct a detailed analysis of their failure modes. We release these implementations to support further research: [https://github.com/anon-zero-rewards/zero-rewards-rl](https://github.com/anon-zero-rewards/zero-rewards-rl).

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 22362

Loading