Abstract: We study reinforcement learning with access to state observations from a demonstrator in addition to a reward signal. In this setting the demonstrator only supplies sequences of observations, and we leverage these samples to improve the learning efficiency of the agent. Our key insight is that in most environments expert policies only visit a tiny fraction of the total available states. We develop a simple technique, e-stops, to exploit this phenomenon. Using e-stops significantly improves sample complexity by reducing the amount of required exploration, while retaining a performance bound that trades off the rate of convergence with a small asymptotic suboptimality gap. We analyze the regret behavior of e-stops and present empirical results demonstrating that our reset mechanism provides order-of-magnitude speedups over classic reinforcement learning methods.
CMT Num: 8723
Code Link: https://github.com/samuela/e-stops
0 Replies
Loading