Represent Your Own Policies: Reinforcement Learning with Policy-extended Value Function Approximator

Hongyao Tang; Zhaopeng Meng; Jianye HAO; Chen Chen; Daniel Graves; Dong Li; Hangyu Mao; Wulong Liu; Yaodong Yang; LI Wang

Represent Your Own Policies: Reinforcement Learning with Policy-extended Value Function Approximator

Hongyao Tang, Zhaopeng Meng, Jianye HAO, Chen Chen, Daniel Graves, Dong Li, Hangyu Mao, Wulong Liu, Yaodong Yang, LI Wang

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone

Keywords: Reinforcement Learning

Abstract: We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL), which extends conventional value function approximator (VFA) to take as input not only the state (and action) but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies at the same time and brings an appealing characteristic, i.e., \emph{value generalization among policies}. We formally analyze the value generalization under Generalized Policy Iteration (GPI). From theoretical and empirical lens, we show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies, which is expected to improve consecutive value approximation during GPI. Based on above clues, we introduce a new form of GPI with PeVFA which leverages the value generalization along policy improvement path. Moreover, we propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs. In our experiments, we evaluate the efficacy of value generalization offered by PeVFA and policy representation learning in several OpenAI Gym continuous control tasks. For a representative instance of algorithm implementation, Proximal Policy Optimization (PPO) re-implemented under the paradigm of GPI with PeVFA achieves about 40\% performance improvement on its vanilla counterpart in most environments.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

TL;DR: In this paper, we propose a new form of GPI with Policy-extended Value Function Approximator (PeVFA) and study value generalization among policies along policy improvement path with the help of self-supervised policy representation.

Supplementary Material: pdf

23 Replies

Loading