MultiReAct: Multimodal Tools Augmented Reasoning-Acting Traces for Embodied Agent Planning

Zhouliang Yu; Jie Fu; Yao Mu; Chenguang Wang; Lin Shao; Yaodong Yang

MultiReAct: Multimodal Tools Augmented Reasoning-Acting Traces for Embodied Agent Planning

Zhouliang Yu, Jie Fu, Yao Mu, Chenguang Wang, Lin Shao, Yaodong Yang

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Model, Embodied Agent Planning, Multimodal Reasoning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: In the field of embodied AI, Large Language Models (LLMs) have demonstrated remarkable proficiency in tasks involving straightforward reasoning. However, they encounter substantial challenges when confronted with longer-horizon tasks described in abstract instructions, especially those involving intricate visual concepts. These challenges arise from two main limitations: LLMs, primarily reliant on text, struggle to grapple with the demands of complex embodied tasks that necessitate nuanced multimodal reasoning; LLMs encounter difficulties in recognizing and autonomously recovering from intermediate execution failures. To address these limitations and improve the planning capabilities of LLMs in embodied scenarios, we propose a novel approach named MultiReAct. Our framework made the following efforts: 1. We employ a parameter-efficient adaptation of a pre-trained visual language model, enabling it to tackle embodied planning tasks by translating visual demonstrations into sequences of actionable language commands. 2. Leveraging CLIP as a reward model, we identify instances of sub-instruction execution failure, significantly boosting the success rate in achieving final objectives. 3. We introduce an adaptable paradigm for embodied planning through in-context learning from demonstration, agnostic of the specific Visual Language Model (VLM), and low-level actor. Our model accommodates two distinct low-level actors: an imitation learning agent and a code generation-based actor. We apply the MultiReAct framework to a diverse set of long-horizon planning tasks and exhibit superior performance than previous LLM-based methods. The extensive experimental results underscore the effectiveness of our approach in addressing long-horizon embodied planning.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2595

Loading