Improving Deep Policy Gradients with Value Function Search

Enrico Marchesini; Christopher Amato

Improving Deep Policy Gradients with Value Function Search

Enrico Marchesini, Christopher Amato

Published: 01 Feb 2023, Last Modified: 20 Feb 2023ICLR 2023 posterReaders: Everyone

Keywords: Deep Reinforcement Learning, Deep Policy Gradients

Abstract: Deep Policy Gradient (PG) algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, value function approximation gets stuck in local optima and struggles to fit the actual return, limiting the variance reduction efficacy and leading policies to sub-optimal performance. This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient. To this end, we introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles, providing a computationally inexpensive approach to enhance the supervised learning task on which value networks train. Crucially, we show that improving Deep PG primitives results in improved sample efficiency and policies with higher returns using common continuous control benchmark domains.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

TL;DR: We present a Value Function Search that employs a gradient-free population of perturbed value networks to improve Deep Policy Gradient primitives, leading to higher returns and better sample efficiency.

5 Replies

Loading