TL;DR: We extend the policy gradient formulas to random time horizons both for stochastic and deterministic policies.
Abstract: We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.
Lay Summary: In this work, we improve reinforcement learning in situations where it is unclear how long the related task should last. In previous work, one assumes a fixed amount of time or a task that goes on forever. But in real life, tasks often end at random times - for example, a game might end early if the player loses, or a robot might stop working if its battery runs out. We show that these random endings affect how the learning process should work, especially when it comes to adjusting the system to improve over time. We carefully figure out how to change the learning process (using so-called policy gradients) when tasks end randomly, for two types of learning systems - those that make decisions randomly and those that consider deterministic decisions. We show through experiments that our method helps the learning process work faster and better than previous methods.
Link To Code: https://github.com/riberaborrell/rl-random-times
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: reinforcement learning, policy gradient theorem, random stopping times, optimal control
Submission Number: 10027
Loading