Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Zhuohao Yu; Weizheng Gu; Yidong Wang; Xingru Jiang; Zhengran Zeng; Jindong Wang; Wei Ye; Shikun Zhang

Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Zhuohao Yu, Weizheng Gu, Yidong Wang, Xingru Jiang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps. We introduce **O**utcome **R**efining **P**rocess **S**upervision, which unifies process and outcome supervision by leveraging executable verification: a tree-structured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning. Experiments across 5 models and 3 benchmarks show consistent gains, with **26.9%** higher correctness and **42.2%** improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges.

Lay Summary: Large language models can write code, but they struggle with complex programming tasks that require careful step-by-step reasoning - like a student who can memorize formulas but struggles with multi-step word problems. Current methods either only check if the final code works (missing opportunities to improve) or require expensive training of separate AI systems to guide the reasoning process. We propose ORPS, which guides AI reasoning by actually running code at each step and using the results to explore different solution strategies - like having a programming tutor who tests your code as you write it. Instead of following one path, our system maintains multiple solution attempts simultaneously, learning from execution results to identify better algorithms. This approach improved code generation success by 27% while making solutions 42% more efficient, without requiring expensive training of guidance systems. Remarkably, smaller AI models using our method outperformed larger models, suggesting that good reasoning matters more than raw model size for complex programming tasks.

Link To Code: https://github.com/zhuohaoyu/ORPS

Primary Area: Deep Learning->Large Language Models

Keywords: Code Generation, Process Supervision, Reasoning, Reward Models, Inference-Time Scaling, Large Language Models

Submission Number: 9782

Loading