Keywords: Language Models; Reward Models; Temporal Difference; LLM RL; LLM Inference
Abstract: Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences (TD) for training-time reinforcement learning and inference-time verification. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-$N$ (up to 6.6\%) and tree-search (up to 23.7\%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL --- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain --- and yield higher-quality language model policies in 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). All code is available at https://anonymous.4open.science/r/TDRM-CDD6.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5415
Loading