Revisiting Value Estimation in Policy Gradient Methods

Tao Wang; Jie Feng; Yuexin Bian; Yuanyuan Shi; Sicun Gao

Revisiting Value Estimation in Policy Gradient Methods

Tao Wang, Jie Feng, Yuexin Bian, Yuanyuan Shi, Sicun Gao

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: policy gradient, temporal-difference, continuous control

TL;DR: We introduce a well-posedness framework for temporal-difference estimation in policy gradient methods, showing that with proper formulation and bootstrapping, even simple policy gradient methods can solve various control problems.

Abstract: Temporal-difference (TD) estimation is a central component of value estimation in reinforcement learning. However, its role within policy gradient methods has not been systematically understood. In this work, we introduce a framework grounded in the notion of well-posedness, which provides a rigorous formulation of TD estimation across a wide range of control problems and enables more accurate value estimation. Through extensive empirical studies, we further show that when policy optimization is properly formulated and combined with an appropriate bootstrapping strategy, even the vanilla policy gradient algorithm can reliably solve the problem. These findings indicate that deep reinforcement learning methods can be made more robust and interpretable with proper problem formulation.

Primary Area: reinforcement learning

Submission Number: 4910

Loading