Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Published: 01 Jan 2024, Last Modified: 26 Jan 2025AAMAS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL approaches allow agents to learn reward functions from human feedback. Despite recent successes, many of the HitL RL methods still require numerous human interactions to learn successful reward functions. To that end, this work introduces Sub-optimal Data Pre-training, SDP, a method that leverages reward-free, sub-optimal data to improve the feedback efficiency of HitL RL algorithms. We demonstrate that SDP can significantly improve over state-of-the-art HitL RL algorithms in three DMControl environments.
Loading