Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision.
Abstract: Process and outcome supervision represent two fundamental approaches to reinforcement learning, especially for complex reasoning tasks in large language models. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we provide a possible theoretical resolution to this debate. Perhaps surprisingly, our main theorem shows that: *under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision*. At the core of this result lies the novel *Change of Trajectory Measure Lemma*---a powerful technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a simple yet powerful connection between outcome and process supervision. These findings suggest that the empirically observed performance gap between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data and algorithm design for reinforcement learning.
Lay Summary: Process supervision and outcome supervision are two key types of data used in reasoning tasks. Our findings show that the statistical difficulty of learning from these two data types is not significantly different. This suggests that models trained using process supervision data—often easier to obtain—can perform just as well as those trained with outcome supervision data.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Reinforcement Learning Theory, Process Supervision, Outcome Supervision, Reward Modeling, Markov Decision Process
Submission Number: 12753
Loading