Accidental explorationa through value predictors


Sep 27, 2018 (modified: Oct 10, 2018) ICLR 2019 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. Of course, in practice this is never the case. In this paper we examine a specific result of this disparity. We focus on the case where the finiteness of trajectories also makes the underlying process to lose the Markov property. This causes the standard state value estimators to become biased, which in turn manifests as a vastly different learning dynamic when algorithms use value predictors. We investigate these claims theoretically for a one dimensional random walk and Wiener process, and empirically on a number of simple environments. We use GAE as an algorithm which uses a value predictor and compare it to a plain policy gradient.
  • TL;DR: We study the biases introduced in common value predictors by the fact that trajectories are, in practice, finite.
  • Keywords: reinforcement learning, value predictors, exploration
0 Replies