Keywords: Machine Learning, ICML, Reinforcement Learning, Preference-based RL, Continuous Control, Gaussian Process, Hallucinated Inputs, Optimistic Exploration
TL;DR: HIP-RL is a novel Preference-based Reinforcement Learning (PbRL) algorithm for continuous domains. We provide both convergence guarantees and experimental evaluation of HIP-RL.
Abstract: Preference-based Reinforcement Learning (PbRL) enables agents to learn policies based on preferences between trajectories rather than explicit reward functions. Previous approaches to PbRL are either experimental and successfully used in real-world applications but lack theoretical understanding, or they have strong theoretical guarantees but only for tabular settings.
In this work, we propose a novel practical PbRL algorithm in the continuous domain called Hallucinated Inputs Preference-based RL (HIP-RL) which filled the gap between theory and practice. HIP-RL parametrizes the set of transition models and uses hallucinated inputs to facilitate optimistic exploration in continuous state-action spaces by controlling the epistemic uncertainty. We construct regret bounds for HIP-RL and show that they are sublinear for Gaussian Process dynamic and reward models. Moreover, we experimentally demonstrate the effectiveness of HIP-RL.
Submission Number: 39
Loading