Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

Published: 28 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop MLxOREveryoneRevisionsBibTeXCC BY 4.0
Keywords: Importance sampling, reinforcement learning, policy gradient
Abstract: We study trajectory reuse in natural policy gradient methods. Classical policy gradient algorithms require large amounts of fresh data, which limits their sample efficiency. We propose RNPG, a reuse-based natural policy gradient algorithm that incorporates past trajectories through importance weighting of both the gradient and the Fisher information matrix estimators. We establish asymptotic convergence and a weak convergence rate for RNPG, showing that reuse improves efficiency without altering the limiting behavior. Experiments on the Cartpole benchmark demonstrate that RNPG achieves faster convergence and smoother performance than VPG and VNPG, with additional gains from larger reuse sizes. Our results highlight the theoretical and empirical benefits of reusing trajectories in policy optimization.
Submission Number: 12
Loading