Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-oriented Dialogue Systems

Yihao Feng; Shentao Yang; Shujian Zhang; Jianguo Zhang; Caiming Xiong; Mingyuan Zhou; Huan Wang

Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-oriented Dialogue Systems

Yihao Feng, Shentao Yang, Shujian Zhang, Jianguo Zhang, Caiming Xiong, Mingyuan Zhou, Huan Wang

Published: 01 Feb 2023, Last Modified: 22 Jun 2025ICLR 2023 posterReaders: Everyone

Keywords: task-oriented dialogue, reinforcement learning, reward learning

Abstract: When learning task-oriented dialogue (ToD) agents, reinforcement learning (RL) techniques can naturally be utilized to train dialogue strategies to achieve user-specific goals. Prior works mainly focus on adopting advanced RL techniques to train the ToD agents, while the design of the reward function is not well studied. This paper aims at answering the question of how to efficiently learn and leverage a reward function for training end-to-end (E2E) ToD agents. Specifically, we introduce two generalized objectives for reward-function learning, inspired by the classical learning-to-rank literature. Further, we utilize the learned reward function to guide the training of the E2E ToD agent. With the proposed techniques, we achieve competitive results on the E2E response-generation task on the Multiwoz 2.0 dataset. Source code and checkpoints are publicly released at https://github.com/Shentao-YANG/Fantastic_Reward_ICLR2023.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

TL;DR: we propose techniques for learning and utilizing reward functions that can be used for training task-oriented dialogue agents

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/fantastic-rewards-and-how-to-tame-them-a-case/code)

20 Replies

Loading