Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We demonstrate through both theoretical analysis and experiments that vanilla policy gradient can achieve performance comparable to PPO by simply increasing the number of value steps per iteration.
Abstract: Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.
Lay Summary: Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is value estimation. To demonstrate this, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. We also show that vanilla policy gradient is significantly more robust to hyperparameter choices, has the potential to serve as a robust and effective alternative to PPO in various robot learning tasks.
Link To Code: https://github.com/taowang0/value-estimation-vpg
Primary Area: Reinforcement Learning->Deep RL
Keywords: policy gradient methods, value estimation, trust region methods, deep RL
Submission Number: 2698
Loading