Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Large Language Model, Embodied Agent Planning, Multimodal Reasoning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: In the field of embodied AI, Large Language Models (LLMs) have demonstrated remarkable proficiency in tasks involving straightforward reasoning.
However, they encounter substantial challenges when confronted with longer-horizon tasks described in abstract instructions, especially those involving intricate visual concepts.
These challenges arise from two main limitations:
LLMs, primarily reliant on text, struggle to grapple with the demands of complex embodied tasks that necessitate nuanced multimodal reasoning;
LLMs encounter difficulties in recognizing and autonomously recovering from intermediate execution failures.
To address these limitations and improve the planning capabilities of LLMs in embodied scenarios, we propose a novel approach named MultiReAct.
Our framework made the following efforts:
1. We employ a parameter-efficient adaptation of a pre-trained visual language model, enabling it to tackle embodied planning tasks by translating visual demonstrations into sequences of actionable language commands.
2. Leveraging CLIP as a reward model, we identify instances of sub-instruction execution failure, significantly boosting the success rate in achieving final objectives.
3. We introduce an adaptable paradigm for embodied planning through in-context learning from demonstration, agnostic of the specific Visual Language Model (VLM), and low-level actor.
Our model accommodates two distinct low-level actors: an imitation learning agent and a code generation-based actor.
We apply the MultiReAct framework to a diverse set of long-horizon planning tasks and exhibit superior performance than previous LLM-based methods.
The extensive experimental results underscore the effectiveness of our approach in addressing long-horizon embodied planning.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2595
Loading