Intra-Trajectory Consistency for Reward Modeling

Intra-Trajectory Consistency for Reward Modeling

ICLR 2026 Conference Submission6830 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reward model, RLHF, NLP

Abstract: Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) and inference-time verification. Due to the prohibitive cost of fine-grained annotations, current reward models typically learn from holistic response scores to determine outcome rewards. However, this coarse-grained supervision makes it difficult for the reward model to identify which specific components within a response trajectory truly correlate with the final score, leading to poor generalization of unseen responses. In this paper, we introduce an intra-trajectory consistency regularization to propagate coarse, response-level supervision into fine-grained learning signals. Inspired by a Bayesian framework, our method enforces a simple principle: The rewards of adjacent generation processes should be more consistent when the connecting token has a higher generation probability. We apply the proposed regularization to the advanced outcome reward model, improving its performance on RewardBench. Furthermore, we demonstrate that the reward model trained with the proposed regularization yields better DPO-aligned policies and achieves superior best-of-N inference-time verification results. Our implementation code is provided in the supplementary material.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 6830

Loading