Instruct2Act: Mapping Multi-modality Instructions to Robotic Arm Actions with Large Language Model

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: large language models, robotic manipulation, code generation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Foundation models have significantly advanced in various applications, including text-to-image generation, open-vocabulary segmentation, and natural language processing. This paper presents Instruct2Act, a framework that leverages Large Language Models (LLMs) to convert multi-modal instructions to sequential actions for robotic manipulation tasks. Specifically, Instruct2Act uses LLMs to generate Python programs that form a comprehensive perception, planning, and action loop for robotic tasks. It uses pre-defined APIs to access multiple foundation models, with the Segment Anything Model (SAM) identifying potential objects and CLIP semantically classifying them. This approach combines the strengths of foundation models and robotic actions to transform complex high-level instructions into precise policy codes. Our approach is adaptable and versatile, capable of handling various instruction modalities and input types, and meeting specific task requirements. We validated the practicality and efficiency of our approach on robotic tasks including different tabletop and 6 Degree of Freedom(DoF) manipulation scenarios in both simulation and real-world environments. Furthermore, our zero-shot method surpasses many state-of-the-art learning-based policies in several tasks. The code for our proposed approach is available at https://anonymous.4open.science/r/Instruct2Act, providing a solid benchmark for high-level robotic instruction tasks with diverse modality inputs.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9438
Loading