Hindsight PRIORs for Reward Learning from Human Preferences

Mudit Verma; Katherine Metcalf

Hindsight PRIORs for Reward Learning from Human Preferences

Mudit Verma, Katherine Metcalf

Published: 16 Jan 2024, Last Modified: 16 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: preference based reinforcement learning, world models, return redistribution

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Presents a method to address credit assignment problem in preference-based reinforcement learning by guiding rewards to key states according to relative state importance.

Abstract: Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning one from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference resulting in data intensive approaches and subpar reward models. We address such limitations by introducing a credit assignment strategy (PRIOR) that uses a forward dynamics world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, PRIOR achieves 80% success rate with half the amount of data compared to baselines. The performance gains and our ablations demonstrate the benefits even a simple credit assignment strategy can have on reward learning and that state importance in forward dynamics prediction is a strong proxy for a state's contribution to a preference decision.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: reinforcement learning

Submission Number: 7430

Loading