LLM-DDP: Improving Policy Learning for Composite Task Oriented Dialogues via Large Language Model feedbacks

LLM-DDP: Improving Policy Learning for Composite Task Oriented Dialogues via Large Language Model feedbacks

ACL ARR 2025 February Submission5121 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Dialogue Policy (DP) is pivotal in Task-Oriented Dialogue (TOD), and Reinforcement Learning (RL) has shown good effectiveness in training DP scenarios. However, RL-based DPs encounter challenges in composite tasks with domain dependencies, which involve managing interrelated subtasks across various domains. In particular, the Proximal Policy Optimization (PPO) algorithm, as an efficient, stable, and user-friendly RL algorithm, is gradually becoming the preferred tool for solving complex reinforcement learning tasks; meanwhile, Large Language Models (LLMs) have shown a profound understanding of common sense content across various domains. Therefore, we propose the integration of LLMs with an enhanced PPO method to tackle composite tasks, which we term the LLM Feedback Domain Dependent Policy (LLM-DDP). Improving the capability of TOD systems to address domain-dependent issues is achieved by integrating the domain prioritization logic of LLMs into the actor-critic framework of PPO. Furthermore, we introduce a domain-driven critic loss function, which enhances the policy network's ability to incorporate domain prioritization logic. In the MultiWOZ 2.1 dataset, with identical parameter configurations and dialogue turns, our study achieved superior performance and validated the efficacy of the proposed methodology.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: task-oriented, commonsense reasoning, dialogue state tracking

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5121

Loading