VinePPO: Refining Credit Assignment in RL Training of LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.
Lay Summary: Large language models (LLMs) like ChatGPT learn complex tasks through trial-and-error training called reinforcement learning. During that process, the model may generate many reasoning steps before producing an answer, but only a few steps actually matter. Figuring out which steps are truly helpful, known as the “credit assignment” problem, is crucial, yet the standard methods such as (PPO or GRPO) either completely treat all steps equally or rely on a helper model that often guesses incorrectly. We found that this helper, called the value network, frequently fails to recognize which reasoning steps contribute to success. This may explain why recent simplified approaches that ignore step-by-step evaluation still perform surprisingly well. Instead of discarding credit assignment, we propose VinePPO, which makes it accurate by directly measuring the usefulness of each step through re-simulation. Since language models can easily restart from any intermediate point by re-feeding context, VinePPO uses this property to get reliable, unbiased feedback, without needing to train a separate network. VinePPO outperforms PPO and shortcut methods on math reasoning tasks, achieving better accuracy with less total training time. Our results show that smarter credit assignment can still drive better LLMs.
Link To Code: https://github.com/McGill-NLP/VinePPO
Primary Area: Deep Learning->Large Language Models
Keywords: RL for LLM, Verifiable Rewards, Credit Assignment
Submission Number: 7155
Loading