Accidental exploration through value predictors

Tomasz Kisielewski, Damian Leśniak, Maia Pasek

Sep 27, 2018 ICLR 2019 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. In practice learning occurs on finite trajectories. In this paper we examine a specific result of this disparity, namely a strong bias of the time-bounded Every-visit Monte Carlo value estimator. This manifests as a vastly different learning dynamic for algorithms that use value predictors, including encouraging or discouraging exploration. We investigate these claims theoretically for a one dimensional random walk, and empirically on a number of simple environments. We use GAE as an algorithm involving a value predictor and evolution strategies as a reference point.
  • TL;DR: We study the biases introduced in common value predictors by the fact that trajectories are, in practice, finite.
  • Keywords: reinforcement learning, value predictors, exploration
0 Replies