Variational oracle guiding for reinforcement learning

Dongqi Han; Tadashi Kozuno; Xufang Luo; Zhao-Yun Chen; Kenji Doya; Yuqing Yang; Dongsheng Li

Variational oracle guiding for reinforcement learning

Dongqi Han, Tadashi Kozuno, Xufang Luo, Zhao-Yun Chen, Kenji Doya, Yuqing Yang, Dongsheng Li

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 PosterReaders: Everyone

Keywords: variational Bayes, oracle guiding, reinforcement learning, decision making, probabilistic modeling, game, Mahjong

Abstract: How to make intelligent decisions is a central problem in machine learning and artificial intelligence. Despite recent successes of deep reinforcement learning (RL) in various decision making problems, an important but under-explored aspect is how to leverage oracle observation (the information that is invisible during online decision making, but is available during offline training) to facilitate learning. For example, human experts will look at the replay after a Poker game, in which they can check the opponents' hands to improve their estimation of the opponents' hands from the visible information during playing. In this work, we study such problems based on Bayesian theory and derive an objective to leverage oracle observation in RL using variational methods. Our key contribution is to propose a general learning framework referred to as variational latent oracle guiding (VLOG) for DRL. VLOG is featured with preferable properties such as its robust and promising performance and its versatility to incorporate with any value-based DRL algorithm. We empirically demonstrate the effectiveness of VLOG in online and offline RL domains with tasks ranging from video games to a challenging tile-based game Mahjong. Furthermore, we publish the Mahjong environment and an offline RL dataset as a benchmark to facilitate future research on oracle guiding (https://github.com/Agony5757/mahjong).

One-sentence Summary: We propose a variational Bayes framework leveraging oracle (hindsight) information available in training to improve deep reinforcement learning.

Supplementary Material: zip

30 Replies

Loading