Simulating User Agents for Embodied Conversational AI

Daniel Philipov; Vardhan Dongre; gokhan tur; Dilek Hakkani Tur

Simulating User Agents for Embodied Conversational AI

Daniel Philipov, Vardhan Dongre, gokhan tur, Dilek Hakkani Tur

Published: 22 Oct 2024, Last Modified: 30 Oct 2024NeurIPS 2024 Workshop Open-World Agents PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: User Simulator, LLM Agents, Conversational AI, Embodied AI

TL;DR: User-simulator agent for predicting dialogue acts from situated embodied conversational history

Abstract: Embodied agents designed to assist users with tasks must possess the ability to engage in natural language interactions, interpret user instructions, execute actions to complete tasks and communicate effectively to resolve issues. However, collecting large-scale, diverse datasets of situated human-robot dialogues to train and evaluate such agents is expensive, labor-intensive, and time-consuming. To address this challenge, we propose building a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a specific user goal (e.g., make breakfast), at each time step during an interaction with an embodied agent (or a robot), the user agent may "observe" the robot actions or "speak" to either proactively intervene with the robot behavior or reactively answer the robot’s questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot's interaction and task completion ability, as well as for future research, such as reinforcement learning using AI feedback. We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the benchmark TEACh dataset. We perform three experiments: zero-shot prompting to predict the dialogue act from history, few-shot prompting, and fine-tuning on the TEACh training subset. Our results demonstrate that the LLM-based user agent can achieve an F-measure of 42% in mimicking human speaking behavior with simple zero-shot prompting and 43.4% with few-shot prompting. Through fine-tuning, we achieved similar success in deciding when to speak but much greater success in deciding what to say, from 51.1% to 62.5%. These findings showcase the feasibility and promise of the proposed approach for assessing and enhancing the effectiveness and reliability of robot task completion through natural language communication.

Submission Number: 85

Loading