Self-Supervised Continuous Control without Policy Gradient

Hao Sun; Ziping Xu; Meng Fang; Yuhang Song; Jiechao Xiong; Bo Dai; Zhengyou Zhang; Bolei Zhou

Self-Supervised Continuous Control without Policy Gradient

Hao Sun, Ziping Xu, Meng Fang, Yuhang Song, Jiechao Xiong, Bo Dai, Zhengyou Zhang, Bolei Zhou

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Self-Supervised Reinforcement Learning, Continuous Control, Zeroth-Order Optimization

Abstract: Despite the remarkable progress made by the policy gradient algorithms in reinforcement learning (RL), sub-optimal policies usually result from the local exploration property of the policy gradient update. In this work, we propose a method called Zeroth-Order Supervised Policy Improvement (ZOSPI) that exploits the estimated value function $Q$ globally while preserving the local exploitation of the policy gradient methods. Experiments show that ZOSPI achieves competitive results on the MuJoCo benchmarks with a remarkable sample efficiency. Moreover, different from the conventional policy gradient methods, the policy learning of ZOSPI is conducted in a self-supervised manner. We show such a self-supervised learning paradigm has the flexibility of including optimistic exploration as well as adopting a non-parametric policy.

One-sentence Summary: We propose a self-supervised approache for continuous control tasks without policy gradient.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=gnh4I_2hH0

5 Replies

Loading