Does Reinforcement Learning from Human Feedback Framework Still Work for Task-Oriented Dialogue Systems?

Does Reinforcement Learning from Human Feedback Framework Still Work for Task-Oriented Dialogue Systems?

ACL ARR 2024 June Submission2207 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The paradigm of reinforcement learning from human feedback (RLHF) after supervised fine-tuning (SFT) language models has become widespread. In this work, we investigate whether RLHF with turn-level preferences is still effective in task-oriented dialogue (TOD) task that requires dialog-level rewards. Since there is no human preference dataset for TOD task, we develop two synthetic feedback generation methods for fully annotated or partially annotated TOD dataset. We compare these two methods to the corresponding SFT methods in an online environment where user goals are unknown. Despite the simplicity of the proposed methods, RLHF outperformed SFT on the partially annotated TOD dataset in both corpus-based and simulator-based evaluations. Our comprehensive experiments present a direction for effectively enhancing system performance using data generated while providing services in real-world environments.

Paper Type: Short

Research Area: Dialogue and Interactive Systems

Research Area Keywords: task-oriented

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2207

Loading