Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Katherine Metcalf; Miguel Sarabia; Natalie Mackraz; Barry-John Theobald

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Katherine Metcalf, Miguel Sarabia, Natalie Mackraz, Barry-John Theobald

Published: 30 Aug 2023, Last Modified: 20 Apr 2025CoRL 2023 PosterReaders: Everyone

Keywords: human-in-the-loop learning, preference-based RL, RLHF

TL;DR: We provide state-action transition dynamics into a reward function learnt from trajectory preferences and find that we can obtain baseline performance with one order of magnitude fewer preferences.

Abstract: Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that encoding environment dynamics in the reward function improves the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) encoding environment dynamics in a state-action representation $z^{sa}$ via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from $z^{sa}$, which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83% and 66% of ground truth reward policy performance versus only 38% and 21% without environment dynamics. The performance gains demonstrate that _explicitly encoding environment dynamics improves preference-learned reward functions_.

Student First Author: no

Instructions: I have read the instructions for authors (https://corl2023.org/instructions-for-authors/)

Publication Agreement: pdf

Poster Spotlight Video: mp4

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/sample-efficient-preference-based/code)

9 Replies

Loading