Keywords: policy optimization, uncertainty
Abstract: Variational policy optimization has become increasingly attractive to the reinforcement learning community because of its strong capability in uncertainty modeling and environment generalization. However, almost all existing studies in this area rely on Kullback–Leibler (KL) divergence which is unfortunately ill-defined in several situations. In addition, the policy is parameterized and optimized in weight space, which may not only bring additional unnecessary bias but also make the policy learning harder due to the complicatedly dependent weight posterior. In the paper, we design a novel functional Wasserstein variational policy optimization (FWVPO) based on the Wasserstein distance between function distributions. Specifically, we firstly parameterize policy as a Bayesian neural network but from a function-space view rather than a weight-space view and then propose FWVPO to optimize and explore the functional policy posterior. We prove that our FWVPO is a valid variational Bayesian objective and also guarantees the monotonic expected reward improvement under certain conditions. Experimental results on multiple reinforcement learning tasks demonstrate the efficiency of our new algorithm in terms of both cumulative rewards and uncertainty modeling capability.
Supplementary Material: zip
List Of Authors: Xuan, Junyu and Wu, Mengjing and Liu, Zihe and Lu, Jie
Latex Source Code: zip
Signed License Agreement: pdf
Submission Number: 193
Loading