Keywords: Preference-based Reinforcement Learning, Human-in-the-Loop RL, Evolutionary learning, Partially Observable
TL;DR: We present a new method for preference-based Reinforcement Learning with a genetic algorithm that learns a return model (instead of a reward model), which is highly beneficial for partially observable environments.
Abstract: A significant challenge in reinforcement learning is how to accurately convey our desires to the artificial agent.
Preference-based reinforcement learning uses human preferences between concrete examples of the agent's behavior to model the reward or the return function the human intends.
However, the existing models discard much information and can therefore be less accurate than possible, especially if the environment is only partially observable.
To overcome this limitation, the model presented in this thesis combines all available information (all observations made during one episode) through a temporal convolutional net to model the return function, instead of a reward function, from preferences. The reinforcement learning – implemented with a genetic algorithm – is then guided by this model.
We show that our method is a viable way to apply preference-based reinforcement learning in partially observable environments.
1 Reply
Loading