Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems

Yihao Feng; Shentao Yang; Shujian Zhang; Jianguo Zhang; Caiming Xiong; Mingyuan Zhou; Huan Wang

Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems

Yihao Feng, Shentao Yang, Shujian Zhang, Jianguo Zhang, Caiming Xiong, Mingyuan Zhou, Huan Wang

Published: 21 Oct 2022, Last Modified: 06 Jul 2025LaReL 2022Readers: Everyone

Keywords: task-oriented dialogue systems, reinforcement learning, reward learning

TL;DR: Reward function learning for task-oriented dialogue systems

Abstract: When learning task-oriented dialogue (TOD) agents, one can naturally utilize reinforcement learning (RL) techniques to train conversational strategies to achieve user-specific goals. Existing works on training TOD agents mainly focus on developing advanced RL algorithms, while the mechanical designs of reward functions are not well studied. This paper discusses how we can better learn and utilize reward functions for training TOD agents. Specifically, we propose two generalized objectives for reward function learning inspired by the classical learning to rank losses. Further, to address the high variance issue of policy gradient estimation using REINFORCE, we leverage the gumbel-softmax trick to better estimate the gradient for TOD policies, which significantly improves the training stability for policy learning. With the above techniques, we can outperform the state-of-the-art results on the end-to-end dialogue task on the Multiwoz 2.0 dataset.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/fantastic-rewards-and-how-to-tame-them-a-case/code)

3 Replies

Loading