- Keywords: exploration, reinforcement learning
- Abstract: Randomized least-square value iteration (RLSVI) is a provably efficient exploration method. However, it is limited to the case where 1) a good feature is known in advance and 2) this feature is fixed during the training: if otherwise, RLSVI suffers an unbearable computational burden to obtain the posterior samples of the parameter in the $Q$-value function. In this work, we present a practical algorithm named HyperDQN, addressing these two issues under the context of deep reinforcement learning, where the feature changes over iterations. HyperDQN is built on two parametric models: in addition to a non-linear neural network (i.e., base model) that predicts $Q$-values, our method employs a probabilistic hypermodel (i.e., meta model), which outputs the parameter of the base model. When both models are jointly optimized under a specifically designed objective, three purposes can be achieved. First, the hypermodel can generate approximate posterior samples regarding the parameter of the $Q$-value function. As a result, diverse $Q$-value functions are sampled to select exploratory action sequences. This retains the punchline of RLSVI for efficient exploration. Second, a good feature is learned to approximate $Q$-value functions. This addresses limitation 1. Third, the posterior samples of the $Q$-value function can be obtained in a more efficient way than the existing methods, and the changing feature does not affect the efficiency. This deals with limitation 2. On the Atari 2600 suite, after $20$M samples, HyperDQN achieves about $2 \times$ improvements over (double) DQN, the advanced method Bootstrapped DQN, and the SOTA exploration bonus method OB2I. For another challenging task SuperMarioBros, HyperDQN outperforms baselines on $7$ out of $9$ games.
- One-sentence Summary: We design a practical randomized exploration method to address the sample efficiency issue in online reinforcement learning.