Keywords: Reinforcement learning, Posterior, Model-based reinforcement learning
TL;DR: We proposed an improved posterior factorization for PSRL under approximate inference; and two sampling strategies.
Abstract: Model-based reinforcement learning algorithms (MBRL) hold tremendous promise for improving the sample efficiency in online RL. However, many existing popular MBRL algorithms cannot deal with exploration and exploitation properly. Posterior sampling reinforcement learning (PSRL) serves as a promising approach for automatically trading off the exploration and exploitation, but the theoretical guarantees only hold under exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be fairly suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass the baseline methods by a significant margin on both dense rewards and sparse rewards tasks from DM control suite, OpenAI Gym and Metaworld benchmarks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)