Keywords: preference-based reinforcement learning, offline reinforcement learning, deep reinforcement learning, optimal transport
TL;DR: An offline preference-based reinforcement learning algorithm that transfers preferences across tasks using optimal transport.
Abstract: In preference-based Reinforcement Learning (PbRL), aligning rewards with human intentions often necessitates a substantial volume of human-provided labels. Furthermore, the expensive preference data from prior tasks often lacks reusability for subsequent tasks, resulting in repetitive labeling for each new task. In this paper, we propose a novel zero-shot cross-task preference-based RL algorithm that leverages labeled preference data from source tasks to infer labels for target tasks, eliminating the requirement for human queries. Our approach utilizes Gromov-Wasserstein distance to align trajectory distributions between source and target tasks. The solved optimal transport matrix serves as a correspondence between trajectories of two tasks, making it possible to identify corresponding trajectory pairs between tasks and transfer the preference labels. However, direct learning from these inferred labels might introduce noisy or inaccurate reward functions. To this end, we introduce Robust Preference Transformer, which considers both reward mean and uncertainty by modeling rewards as Gaussian distributions. Through extensive empirical validation on robotic manipulation tasks from Meta-World and Robomimic, our approach exhibits strong capabilities of transferring preferences between tasks in a zero-shot way and learns reward functions from noisy labels robustly. Notably, our approach significantly surpasses existing methods in limited-data scenarios. The videos of our method are available on the website: https://sites.google.com/view/pot-rpt.
Submission Number: 51
Loading