Keywords: Statistical physics of learning, Generalization models, Reinforcement learning, Reinforce policy gradient
Abstract: Reinforcement learning (RL) algorithms have proven transformative in a range of
domains. To tackle real-world domains, these systems often use neural networks
to learn policies directly from pixels or other high-dimensional sensory input. By
contrast, much theory of RL has focused on discrete state spaces or worst case
analyses, and fundamental questions remain about the dynamics of policy learning
in high dimensional settings. Here we propose a simple high-dimensional model
of RL and derive its typical dynamics as a set of closed-form ODEs. We show that
the model exhibits rich behavior including delayed learning under sparse rewards;
a speed-accuracy trade-off depending on reward stringency; and a dependence
of learning regime on reward baselines. These results offer a first step toward
understanding policy gradient methods in high dimensional settings.
0 Replies
Loading