Self-Corrected Multimodal Large Language Model for Robot Manipulation and Reflection

Jiaming Liu; Chenxuan Li; Guanqun Wang; Xiaoqi Li; Sixiang Chen; Chuyan Xiong; Jiaxin Ge; Kaichen Zhou; Shanghang Zhang

Self-Corrected Multimodal Large Language Model for Robot Manipulation and Reflection

Jiaming Liu, Chenxuan Li, Guanqun Wang, Xiaoqi Li, Sixiang Chen, Chuyan Xiong, Jiaxin Ge, Kaichen Zhou, Shanghang Zhang

16 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Robot Manipulation, Pose Correction, Multimodal Large Language Model

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated potential in visual instruction following across various tasks. Recently, some studies have integrated MLLMs into robotic manipulation, allowing robots to interpret multimodal information and predict low-level actions. While MLLM-based policies have shown promising progress, they may predict failure execution poses when faced with novel tasks or categories. To emulate human-like reasoning modes for more robust manipulation, we propose a Self-Corrected (SC)-MLLM. Our model combines fast system reasoning for directly predicting end-effector poses with slow system reasoning for reflecting on and correcting failure actions. For the fast system, we introduce parameter-efficient fine-tuning to empower MLLM with pose prediction capabilities, reframing this as a language modeling problem. For the slow system, when facing execution failures, our model learns to detect the causes of low-level action errors (i.e., position and rotation errors) and adaptively seeks prompt feedback from experts. Based on the feedback, SC-MLLM reflects on the current failure case and attempts to generate the corrected actions. Furthermore, we design a continuous policy learning method using successfully corrected samples, enhancing the model's adaptability to the current scene configuration and reducing the frequency of expert intervention. To evaluate our method, we conduct extensive experiments in both simulation and real-world settings. SC-MLLM significantly improves manipulation accuracy compared to previous state-of-the-art MLLM-based policy (ManipLLM), increasing from 57\% to 79\% on seen object categories and from 47\% to 69\% on unseen novel categories. Our project web page: https://sites.google.com/view/sc-mllm

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 971

Loading