Revisiting Value Estimation in Policy Gradient Methods

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: policy gradient, temporal-difference, continuous control
TL;DR: We introduce a well-posedness framework for temporal-difference estimation in policy gradient methods, showing that with proper formulation and bootstrapping, even simple policy gradient methods can solve various control problems.
Abstract: Temporal-difference (TD) estimation is a central component of value estimation in reinforcement learning. However, its role within policy gradient methods has not been systematically understood. In this work, we introduce a framework grounded in the notion of well-posedness, which provides a rigorous formulation of TD estimation across a wide range of control problems and enables more accurate value estimation. Through extensive empirical studies, we further show that when policy optimization is properly formulated and combined with an appropriate bootstrapping strategy, even the vanilla policy gradient algorithm can reliably solve the problem. These findings indicate that deep reinforcement learning methods can be made more robust and interpretable with proper problem formulation.
Primary Area: reinforcement learning
Submission Number: 4910
Loading