Chain-of-Imagination for Reliable Instruction Following in Decision Making

Enshen Zhou; Yiran Qin; Zhenfei Yin; Yuzhou Huang; Ruimao Zhang; Lu Sheng; Yu Qiao; Jing Shao

Chain-of-Imagination for Reliable Instruction Following in Decision Making

Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao

Published: 22 Oct 2024, Last Modified: 26 Oct 2024NeurIPS 2024 Workshop Open-World Agents PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain-of-Imagination, multimodal large language model, instruction following, low-level control, image generation

Abstract: Enabling the embodied agent to imagine step-by-step the future states and sequentially approach these situation-aware states can enhance its capability to make reliable action decisions from textual instructions. In this work, we introduce a simple but effective mechanism called Chain-of-Imagination (CoI), which repeatedly employs a Multimodal Large Language Model (MLLM) equipped diffusion model to facilitate imagining and acting upon the series of intermediate situation-aware visual sub-goals one by one, resulting in more reliable instruction-following capability. Based on the CoI mechanism, we propose an embodied agent DecisionDreamer as the low-level controller that can be adapted to different open-world scenarios. Extensive experiments demonstrate that DecisionDreamer can achieve more reliable and accurate decision-making and significantly outperform the state-of-the-art generalist agents in the Minecraft and CALVIN sandbox simulators, regarding the instruction-following capability. For more demos, please see https://sites.google.com/view/decisiondreamer.

Submission Number: 37

Loading