TD Learning with Neural Networks - Study of the Leakage Propagation Problem

Hugo Penedones, Damien Vincent, Timothy Mann, Sylvain Gelly

Feb 12, 2018 (modified: Feb 12, 2018) ICLR 2018 Workshop Submission readers: everyone
  • Abstract: In On-Policy Evaluation, one estimates the value function of the data-generating policy with algorithms like Monte-Carlo regression (MC) or Temporal-Difference Learning (TD). We investigate the issue of poor estimation when using a function approximator like a neural network, due to limited data, limited capacity or training process, and how approximation errors can be further propagated by TD bootstrap updates. We suggest that this problem may be mitigated by first learning (unsupervisedly) a representation that separates states that look similar, but are actually quite distant when one looks at the trajectories followed by the policy.
  • TL;DR: TD Learning with neural networks has leakage problems that may be partially mitigated by unsupervised learning
  • Keywords: policy evaluation, temporal difference learning, unsupervised learning, neural networks, machine learning