ReNeg and Backseat Driver: Learning from demonstration with continuous human feedback

Zoe Papakipos; Jacob Beck; Michael Littman

ReNeg and Backseat Driver: Learning from demonstration with continuous human feedback

Zoe Papakipos, Jacob Beck, Michael Littman

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Reinforcement learning (RL) is a powerful framework for solving problems by exploring and learning from mistakes. However, in the context of autonomous vehicle (AV) control, requiring an agent to make mistakes, or even allowing mistakes, can be quite dangerous and costly in the real world. For this reason, AV RL is generally only viable in simulation. Because these simulations have imperfect representations, particularly with respect to graphics, physics, and human interaction, we find motivation for a framework similar to RL, suitable to the real world. To this end, we formulate a learning framework that learns from restricted exploration by having a human demonstrator do the exploration. Existing work on learning from demonstration typically either assumes the collected data is performed by an optimal expert, or requires potentially dangerous exploration to find the optimal policy. We propose an alternative framework that learns continuous control from only safe behavior. One of our key insights is that the problem becomes tractable if the feedback score that rates the demonstration applies to the atomic action, as opposed to the entire sequence of actions. We use human experts to collect driving data as well as to label the driving data through a framework we call ``Backseat Driver'', giving us state-action pairs matched with scalar values representing the score for the action. We call the more general learning framework ReNeg, since it learns a regression from states to actions given negative as well as positive examples. We empirically validate several models in the ReNeg framework, testing on lane-following with limited data. We find that the best solution in this context outperforms behavioral cloning has strong connections to stochastic policy gradient approaches.

TL;DR: We introduce a novel framework for learning from demonstration that uses continuous human feedback; we evaluate this framework on continuous control for autonomous vehicles.

Keywords: learning from demonstration, imitation learning, behavioral cloning, reinforcement learning, off-policy, continuous control, autonomous vehicles, deep learning, machine learning, policy gradient

16 Replies

Loading