- Abstract: In On-Policy Evaluation, one estimates the value function of the data-generating policy with algorithms like Monte-Carlo regression (MC) or Temporal-Difference Learning (TD). We investigate the issue of poor estimation when using a function approximator like a neural network, due to limited data, limited capacity or training process, and how approximation errors can be further propagated by TD bootstrap updates. We suggest that this problem may be mitigated by first learning (unsupervisedly) a representation that separates states that look similar, but are actually quite distant when one looks at the trajectories followed by the policy.
- Keywords: policy evaluation, temporal difference learning, unsupervised learning, neural networks, machine learning
- TL;DR: TD Learning with neural networks has leakage problems that may be partially mitigated by unsupervised learning