Reward Learning through Ranking Mean Squared Error

Chaitanya Kharyal; Calarina Muslimani; Matthew E. Taylor

Reward Learning through Ranking Mean Squared Error

Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor

Published: 01 Jul 2025, Last Modified: 01 Jul 2025RLBrew: Ingredients for Developing Generalist Agents workshop (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning, Reward Learning, Learning from feedback

Abstract: Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce rMSE, a new rating-based RL method that treats human-provided ratings as ordinal targets. Our approach learns from an offline dataset of trajectory–rating pairs, where each trajectory is labeled with a discrete rating (e.g., ''bad'', ''neutral'', ''good''). At each training step, we sample one trajectory per rating class, compute their predicted returns under the learned reward model, and rank them using a differentiable sorting operator (i.e., soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the human ratings. Additionally, we incorporate a conservative regularization term to reduce overestimation on out-of-training-distribution actions. Through experiments with simulated human feedback, we demonstrate that rMSE can outperform another rating-based RL algorithm in Hungry-Thirsty and Lunar Lander. We also found that rMSE can learn reward functions that are more aligned to the simulated preferences than the baseline method. Through experiments with simulated feedback, we show that rMSE outperforms a recently proposed rating-based RL method in the Hungry-Thirsty and Lunar Lander domains. Additionally, rMSE learns reward functions that are better aligned with the simulated ratings.

Submission Number: 3

Loading