RewardCode: Training Generalist Code Reward Model via Pairwise Reinforcement Learning

Songtao Huang; Shiyang Feng; Xiaohan He; Xiangchao Yan; Bo Zhang; LEI BAI

RewardCode: Training Generalist Code Reward Model via Pairwise Reinforcement Learning

Songtao Huang, Shiyang Feng, Xiaohan He, Xiangchao Yan, Bo Zhang, LEI BAI

15 Sept 2025 (modified: 10 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward Models, Reinforcement learning, Code LLMs

Abstract: Test-time scaling improves code generation capacity of LLMs by leveraging a reward model to identify the best solution from multiple candidates. However, coding tasks span diverse domains, making unified evaluation challenging. In this paper, we present RewardCode, a generalist reward model for coding tasks. RewardCode performs principle-guided scoring, generates executable unit tests, and conducts pointwise evaluation of solutions, enabling scalable and fine-grained assessments. To train a cross-task code reward model, we construct CodePair-19K, a dataset of verifiable code preference pairs with task summaries and executable unit tests. Furthermore, we carefully design a two-stage training pipeline for RewardCode. The first stage combines Structural Summarize Fine-Tuning and Group Rejective Fine-Tuning, where diverse task descriptions are distilled into structured summaries to improve cross-domain code understanding and high-quality trajectories are bootstrapped through group rejection sampling from LLMs. The second stage introduces Pairwise-GRPO, a reinforcement learning method that leverages preference pairs to enhance the model’s ability to distinguish between solutions while ensuring the generation of consistent and verifiable unit tests. Experiments on multiple benchmarks show that RewardCode outperforms strong baselines in accuracy and task success, proving its effectiveness in advancing general-purpose Code LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5693

Loading