Keywords: reinforcement learning, model-based, value gradients, attention
TL;DR: A new policy optimization algorithm which backpropagates through the model to solve long-term credit assignment via attention-based temporal shortcuts.
Abstract: This work explores the question of long-term credit assignment in reinforcement learning. Assigning credit over long distances has historically been difficult in both reinforcement learning and recurrent neural networks, where discounting or gradient truncation respectively are often necessary for feasibility, but limit the model's ability to reason over longer time scales. We propose LVGTS, a novel model-based algorithm that bridges the gap between the two fields. By using backpropagation through a latent model and temporal shortcuts to directly propagate gradients, LVGTS assigns credit from the future to the possibly distant past regardless of the use of discounting or gradient truncation. We show, on simple but carefully-designed problems, that our approach is able to perform effective credit assignment even in the presence of distractions.