Mastering Pixel-Based Reinforcement Learning via Positive Unlabeled Policy-Guided Contrast

15 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Pixel Observation, Reinforcement Learning, Self-Supervised Learning, Contrastive Learning, Visual Control Task
TL;DR: In order to find the optimal policy on pixel observation environments, we propose a contrastive learning framework, that only uses positive views and anchor views to contrast policy-guided representation and carry out various experiments.
Abstract: Real-world reinforcement learning has received a significant amount of attention very recently. A fundamental yet challenging problem in this learning paradigm is perceiving real-world environmental information, such that \textit{pixel-based} reinforcement learning emerges, which aims to learn representation from visual observations for policy optimization. In this article, we profoundly elaborate the frameworks of benchmark methods and demonstrate a long-standing \textit{paradox} challenging current methods: in different training phases, exploring visual semantic information can improve and prevent the performance of the learned feature representations from improving. In practice, we further disclose that the over-redundancy issue generally halts the rise of sample efficiency among baseline methods. To remedy the uncovered deficiency of existing methods, we introduce a novel plug-and-play method for pixel-based reinforcement learning. Our model involves the \textit{positive unlabeled policy-guided contrast} to learn jointly anti-redundant and policy-optimization-relevant visual semantic information during training. To sufficiently elucidate the proposed method's innate superiority, we revisit the pixel-based reinforcement learning paradigm from the information theory perspective. The theoretical evidence proves that the proposed model can achieve the tighter lower bound of the mutual information between the policy optimization-related information and the information of the representation derived by the encoder. To carry out the evaluation of our model, we conduct extensive benchmark experiments and illustrate the superior performance of our method over existing methods with respect to the pixel observation environments.
Supplementary Material: zip
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 37
Loading