Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Aishwarya Padmakumar; Mert Inan; Spandana Gella; Patrick L. Lange; Dilek Hakkani-Tur

Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Aishwarya Padmakumar, Mert Inan, Spandana Gella, Patrick L. Lange, Dilek Hakkani-Tur

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Language Grounding to Vision, Robotics and Beyond

Submission Track 2: Dialogue and Interactive Systems

Keywords: Embodied AI, Embodied Task Completion, Language and Robotics, Plan Prediction, Dialog Simulation

TL;DR: We propose a method to simulate embodied dialogs including natural language utterances and actions in a simulated environment and demonstrate its value in improving plan prediction

Abstract: Embodied task completion is a challenge where an agent in a simulated environment must predict environment actions to complete tasks based on natural language instructions and ego-centric visual observations. We propose a variant of this problem where the agent predicts actions at a higher level of abstraction called a plan, which helps make agent actions more interpretable and can be obtained from the appropriate prompting of large language models. We show that multimodal transformer models can outperform language-only models for this problem but fall significantly short of oracle plans. Since collecting human-human dialogues for embodied environments is expensive and time-consuming, we propose a method to synthetically generate such dialogues, which we then use as training data for plan prediction. We demonstrate that multimodal transformer models can attain strong zero-shot performance from our synthetic data, outperforming language-only models trained on human-human data.

Submission Number: 3930

Loading