Recurrent Natural Policy Gradient for POMDPs

TMLR Paper5091 Authors

12 Jun 2025 (modified: 17 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Solving partially observable Markov decision processes (POMDPs) is a long-standing challenge in reinforcement learning (RL) due to the inherent curse of dimensionality arising from the non-stationarity of optimal policies. In this paper, we address this by integrating recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a multi-step temporal difference (TD) method within a natural actor-critic (NAC) framework for computational efficiency. We establish non-asymptotic theoretical guarantees for this method, which demonstrate its effectiveness for solving POMDPs and identify the pathological cases that stem from long-term dependencies. By integrating RNNs into the NAC framework with theoretical guarantees, this work advances the theoretical foundation of RL for POMDPs and provides a scalable framework for solving complex decision-making problems.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Martha_White1
Submission Number: 5091
Loading