MALLVi: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

ICLR 2026 Conference Submission19407 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Object Localization, Robot Manipulation, Multi-Agent Systems, LLM
TL;DR: An integrated framework named MALLVi based on LLM agents to perform generalized robotics manipulation tasks.
Abstract: Task-planning for robotic manipulation tasks using large language models (LLMs) is a relatively new phenomenon. Previous approaches have relied on training specialized models, fine-tuning pipeline components, or adapting LLMs with the setup through prompt tuning. However, many of these approaches lack environmental feedback. We introduce the MALLVi Framework, a Multi-Agent Large Language and Vision framework designed to solve robotic manipulation tasks that leverages closed-loop feedback from the environment. The agents are provided with an instruction in human language, and the vision-language model (VLM) is also given an image of the current environment state. After thorough investigation and reasoning, MALLVi generates a series of realizable atomic instructions necassary for a supposed robot manipulator to complete the task. The VLM receives environmental feedback and prompts the framework either to repeat this procedure until success, or to proceed with the next atomic instruction. Our work shows that with careful prompt engineering, the integration of five LLM agents (Decomposer, Perceptor, Thinker, Actor, and Reflector) can autonomously manage all compartments of a manipulation task - namely, initial perception, object localization, reasoning, and high-level planning. Moreover, the addition of a Descriptor agent can introduce a visual memory of the initial environment state in the pipeline. Crucially, compared to previous works, the reflecting agent can evaluate the completion or failure of each sub-task. We validate our framework through experiments conducted both in simulated environments using VIMABench, RLBench and in real-world settings. Our framework handles diverse tasks, from standard manipulation benchmarks to custom user instructions. Our results show that the agents communicating to plan, execute, and evaluate the tasks iteratively not only lead to generalized performance but also increase average success rate in trials. The essential role of the reflecting in the pipeline is highlighted in experiments.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 19407
Loading