Exploiting Action Distances for Reward Learning from Human Preferences

Mudit Verma; Siddhant Bhambri; Subbarao Kambhampati

Exploiting Action Distances for Reward Learning from Human Preferences

Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

Published: 29 Jun 2023, Last Modified: 04 Oct 2023MFPL PosterEveryoneRevisionsBibTeX

Keywords: human-aware-ai, preference learning, reinforcement learning

TL;DR: The proposed approach improves state of the art performance on goal-oriented preferences, using an action distance measure derived from the learned policy as an auxiliary prediction task for reward learning.

Abstract: Preference-based Reinforcement Learning (PbRL) with binary preference feedbacks over trajectory pairs has proved to be quite effective in learning complex preferences of a human in the loop in domains with high dimensional state spaces and action spaces. While the human preference is primarily inferred from the feedback provided, we propose that, in situations where the human preferences are goal-oriented, the policy being learned (jointly with the reward model) during training can also provide valuable learning signal about the probable goal based on the human preference. To utilize this information, we introduce an action distance measure based on the policy and use it as an auxiliary prediction task for reward learning. This measure not only provides insight into the transition dynamics of the environment but also informs about the reachability of states under the policy by giving a distance to goal measure. We choose six tasks with goal-oriented preferences in the Meta-World domains to evaluate the performance and sample efficiency of our approach. We show that our approach outperforms methods leveraging auxiliary tasks of learning environment dynamics or a non-temporal distance measure adapted by PbRL baselines. Additionally, we show that action distance measure can also accelerate policy learning which is reaffirmed by our experimental results.

Submission Number: 55

Loading