Which Rewards Matter? Reward Selection for Reinforcement Learning from Limited Feedback

Published: 01 Jul 2025, Last Modified: 01 Jul 2025RLBrew: Ingredients for Developing Generalist Agents workshop (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Selection, Reinforcement Learning, Learning from Limited Feedback
TL;DR: A formalization and empirical study of reward selection in reinforcement learning from limited feedback.
Abstract: The effectiveness of reinforcement learning algorithms is fundamentally determined by the reward feedback they receive during training. However, in practical settings, obtaining large quantities of reward feedback is often infeasible due to computational or financial constraints, particularly when relying on human feedback. When reinforcement learning must proceed with limited feedback—labeling rewards for only a fraction of samples—a fundamental question arises: *which* samples should be labeled to maximize policy performance? We formalize this *reward selection* problem for reinforcement learning from limited feedback (RLLF), introducing a general problem setup to enable the study of different selection strategies. Our investigation proceeds in two parts, evaluating the efficacy of (i) simple heuristics that prioritize high-frequency or high-value states, and (ii) learned selection strategies, trained in advance to identify impactful samples for labeling. These strategies tend to select rewards that (1) guide the agent along optimal trajectories, and (2) support recovery toward near-optimal behavior after deviations. Optimal selection methods yield near-optimal policies with significantly fewer labeled rewards than full supervision, highlighting reward selection as a powerful paradigm for scaling reinforcement learning in feedback-limited settings.
Submission Number: 4
Loading