Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue SystemsDownload PDF

Published: 21 Oct 2022, Last Modified: 26 Mar 2024LaReL 2022Readers: Everyone
Keywords: task-oriented dialogue systems, reinforcement learning, reward learning
TL;DR: Reward function learning for task-oriented dialogue systems
Abstract: When learning task-oriented dialogue (TOD) agents, one can naturally utilize reinforcement learning (RL) techniques to train conversational strategies to achieve user-specific goals. Existing works on training TOD agents mainly focus on developing advanced RL algorithms, while the mechanical designs of reward functions are not well studied. This paper discusses how we can better learn and utilize reward functions for training TOD agents. Specifically, we propose two generalized objectives for reward function learning inspired by the classical learning to rank losses. Further, to address the high variance issue of policy gradient estimation using REINFORCE, we leverage the gumbel-softmax trick to better estimate the gradient for TOD policies, which significantly improves the training stability for policy learning. With the above techniques, we can outperform the state-of-the-art results on the end-to-end dialogue task on the Multiwoz 2.0 dataset.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2302.10342/code)
3 Replies

Loading