Abstract: It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways. However, existing approaches usually struggle with compound difficulties caused by the logic-aware decompo-sition and context-aware execution of these tasks. To this end, we introduce MP5, an open-ended multimodal em-bodied system built upon the challenging Minecraft sim-ulator, which can decompose feasible sub-objectives, de-sign sophisticated situation-aware plans, and perform em-bodied action control, with frequent communication with a goal-conditioned active perception scheme. Specifically, MP5 is developed on top of recent advances in Multimodal Large Language Models (MLLMs), and the system is mod-ulated into functional modules that can be scheduled and collaborated to ultimately solve pre-defined context- and process-dependent tasks. Extensive experiments prove that MP5 can achieve a 22% success rate on difficult process-dependent tasks and a 91 % success rate on tasks that heav-ily depend on the context. Moreover, MP5 exhibits a re-markable ability to address many open-ended tasks that are entirely novel. Please see the project page at https: //iranqin. github.io/MP5. github.io/.
Loading