Keywords: Preference Tree Optimization (PTO), Goal-Oriented Dialogue, Look-Ahead Simulation, Direct Preference Optimization (DPO), Motivational Interviewing (MI), Reinforcement Learning (RL), Language Models (LLMs), Conversational AI, Dialogue Systems, Self-Improving AI, Preference-Based Learning, Synthetic Data Generation, Oracle Evaluation, Multi-Turn Dialogue, Human-AI Interaction, Decision-Making in AI, Virtual Patients, Counseling AI, Interactive AI Training
TL;DR: Preference Tree Optimization (PTO) enhances goal-oriented dialogue systems by using look-ahead simulations and preference-based learning to improve decision-making in conversational AI.
Abstract: Developing dialogue systems capable of engaging in multi-turn, goal-oriented conversations remains a significant challenge, especially in specialized domains with limited data. This research proposes a novel framework called \textit{Preference Tree Optimization (PTO)}, designed to iteratively improve agent models in such dialogue systems, by generating preference data using a method called Preference Tree with Look-Ahead. Focusing on Motivational Interviewing (MI)—a counseling technique aimed at facilitating behavioral change—we leverage virtual patients and an oracle evaluator to simulate conversations and generate rich preference datasets. By combining this method with Direct Preference Optimization (DPO), we aim to enhance the agent's decision-making capabilities over iterative training cycles. The proposed framework addresses data scarcity and advances the development of more nuanced and effective dialogue systems in goal-oriented domains.
Experimental evaluations demonstrate that the PTO framework enhances dialogue agents' performance in goal-oriented conversations within the domain of Motivational Interviewing (MI). Models trained with PTO consistently outperformed the baseline in key metrics such as session satisfaction and working alliance. Additionally, incorporating look-ahead simulations led to improved long-term planning and more effective conversational strategies, with deeper look-ahead configurations yielding the most stable and high-scoring results.
Submission Number: 31
Loading