Keywords: Reward Selection, Learning from Limited Feedback
Abstract: In reinforcement learning from limited feedback (RLLF), only a small fraction of an offline dataset can be labeled with rewards, and the central question is which samples should be labeled to learn a strong policy from the resulting partially labeled dataset. Prior work formalized this as a reward-selection problem by focusing on the selection stage while treating downstream policy learning as a black box, in a regime where queried rewards are not retained for reward-model training. We instead study the retained-label setting, where queried rewards can be stored and used to fit a reward model before policy learning. We bound the suboptimality of the learned policy by two sources of error: one from offline RL on an offline dataset, and one from reward-model uncertainty. Since reward selection cannot change the offline dataset, the limited labeling budget must be used to strategically reduce reward uncertainty. Motivated by RLLF's observation that useful rewards tend to keep the agent on high-return trajectories, we propose successor-guided uncertainty reduction (SURE), which uses successor features to select rewards that are both reachable to high-valued states and uncertainty-reducing. Theoretically, we derive SURE from a bound-induced design objective and characterize its exact one-step marginal gain. Empirically, SURE reaches near full-feedback performance with few reward labels across a variety of domains, yielding a strong method for feedback-efficient reinforcement learning.
Submission Number: 146
Loading