Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog

Natasha Jaques; Asma Ghandeharioun; Judy Hanwen Shen; Craig Ferguson; Agata Lapedriza; Noah Jones; Shixiang Gu; Rosalind Picard

Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, Rosalind Picard

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

TL;DR: We show that KL-control from a pre-trained prior can allow RL models to learn from a static batch of collected data, without the ability to explore online in the environment.

Abstract: Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. This is a critical shortcoming for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms which use KL-control to penalize divergence from a pre-trained prior model of probable actions. This KL-constraint reduces extrapolation error, enabling effective offline learning, without exploration, from a fixed batch of data. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. This Way Off-Policy (WOP) algorithm is tested on both traditional RL tasks from OpenAI Gym, and on the problem of open-domain dialog generation; a challenging reinforcement learning problem with a 20,000 dimensional action space. WOP allows for the extraction of multiple different reward functions post-hoc from collected human interaction data, and can learn effectively from all of these. We test real-world generalization by deploying dialog models live to converse with humans in an open-domain setting, and demonstrate that WOP achieves significant improvements over state-of-the-art prior methods in batch deep RL.

Code: https://drive.google.com/open?id=1XG9c4HMXwDrxTOUrVPHwR32EdnJJMuMt

Keywords: batch reinforcement learning, deep learning, dialog, off-policy, human preferences

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/way-off-policy-batch-deep-reinforcement/code)

Original Pdf: pdf

7 Replies

Loading