Recurrent Natural Policy Gradient for POMDPs

Semih Cayci; Atilla Eryilmaz

Recurrent Natural Policy Gradient for POMDPs

Semih Cayci, Atilla Eryilmaz

Published: 12 Oct 2025, Last Modified: 12 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Solving partially observable Markov decision processes (POMDPs) is a long-standing challenge in reinforcement learning (RL) due to the inherent curse of dimensionality arising from the non-stationarity of optimal policies. In this paper, we address this by integrating recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a multi-step temporal difference (TD) method within a natural actor-critic (NAC) framework for computational efficiency. We establish non-asymptotic theoretical guarantees for this method, which demonstrate its effectiveness for solving POMDPs and identify the pathological cases that stem from long-term dependencies. By integrating RNNs into the NAC framework with theoretical guarantees, this work advances the theoretical foundation of RL for POMDPs and provides a scalable framework for solving complex decision-making problems.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Martha_White1

Submission Number: 5091

Loading