CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

Guofu Xie; Yunsheng Shi; Hongtao Tian; Ting Yao; Xiao Zhang; Jun Xu

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang, Jun Xu

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, reinforcement learning, credit assignment

TL;DR: We propose CAPO, an efficient and simple method that improve LLM reasoning by utilizing a LLM as a Generative Process Reward Model (LLM-as-GenPRM) to provide verifiable and fine-grained credit assignment.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback, helping to mitigate reward hacking. However, current RLVR methods typically treat whole responses as single actions, assigning the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method—Credit Assignment Policy Optimization (CAPO). CAPO avoids the complexities of prior approaches. Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. This design choice not only simplifies the training pipeline but also enhances its generality, as our experiments show it works effectively with various powerful, widely accessible open-source models. The fine-grained feedback enables a crucial shift from purely outcome-oriented to process-oriented learning; our analysis of this dynamic leads to a reward structure that balances both objectives. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 23868

Loading