Exploration Driven by an Optimistic Bellman Equation

Samuele Tosatto, Carlo D'Eramo, Joni Pajarinen, Jan Peters

12 May 2021OpenReview Archive Direct UploadReaders: Everyone

Abstract: Exploring high-dimensional state spaces and finding sparse rewards are central problems in reinforcement learning. Exploration strategies are frequently either na¨ıve (e.g., simplistic epsilon-greedy or Boltzmann policies), intractable (i.e., full Bayesian treatment of reinforcement learning) or rely heavily on heuristics. The lack of a tractable but principled exploration approach unnecessarily complicates the application of reinforcement learning to a broader range of problems. Efficient exploration can be accomplished by relying on the uncertainty of the stateaction value function. To obtain the uncertainty, we maintain an ensemble of value function estimates and present an optimistic Bellman equation (OBE) for such ensembles. This OBE is derived from a relative entropy maximization principle and yields an implicit exploration bonus resulting in improved exploration during action selection. The implied exploration bonus can be seen as a well-principled type of intrinsic motivation and exhibits favorable theoretical properties. OBE can be applied to a wide range of algorithms. We propose two algorithms as an application of the principle: Optimistic Q-learning and Optimistic DQN which outperform comparison methods on standard benchmarks

0 Replies