Abstract: Natural gradient methods are appealing in policy optimization due to their invariance to smooth reparameterization and their ability to account for the local geometry of the policy manifold. These properties often lead to improved conditioning of the optimization problem compared to Euclidean policy gradients. However, their reliance on Monte Carlo estimation introduces high variance and sensitivity to hyperparameters. In this paper, we address these limitations by integrating Randomized Quasi-Monte Carlo (RQMC) sampling into the natural actor-critic (NAC) framework. We revisit the NAC linear system and show that, under imperfect value approximation, the NAC update decomposes exactly into the true natural gradient plus a Fisher-metric projection of the Bellman residual onto the score-feature span. We further develop RQMC-based NAC estimators that replace IID sampling with randomized low-discrepancy trajectories. We provide a variance analysis showing that these RQMC-based estimators strictly reduce estimator variance under mild regularity conditions, thereby reducing the propagation of Bellman-residual error into the natural-gradient update. Empirical results on certain reinforcement learning benchmarks demonstrate that our RQMC-enhanced algorithms consistently match or improve upon the performance and stability of their vanilla counterparts
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=7OzWSoA3au
Changes Since Last Submission: Wrong format - Now resolved.
Assigned Action Editor: ~Murat_A_Erdogdu1
Submission Number: 7065
Loading