Keywords: Reinforcement Learning, Recurrent Networks, Stateful Policies, Imitation Learning, Stochastic Stateful Policy Gradient
TL;DR: Novel stochastic policy gradient approximator for stateful policies, such as RNNs.
Abstract: Stateful policies play an important role in reinforcement learning, such as handling
partially observable environments, enhancing robustness, or imposing an inductive
bias directly into the policy structure. The conventional method for training stateful
policies is Backpropagation Through Time (BPTT), which comes with significant
drawbacks, such as slow training due to sequential gradient propagation and the
occurrence of vanishing or exploding gradients. The gradient is often truncated
to address these issues, resulting in a biased policy update. We present a novel
approach for training stateful policies by decomposing the latter into a stochastic
internal state kernel and a stateless policy, jointly optimized by following the
stateful policy gradient. We introduce different versions of the stateful policy
gradient theorem, enabling us to easily instantiate stateful variants of popular
reinforcement learning and imitation learning algorithms. Furthermore, we provide
a theoretical analysis of our new gradient estimator and compare it with BPTT.
We evaluate our approach on complex continuous control tasks, e.g. humanoid
locomotion, and demonstrate that our gradient estimator scales effectively with
task complexity while offering a faster and simpler alternative to BPTT.
Already Accepted Paper At Another Venue: already accepted somewhere else
Submission Number: 27
Loading