ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Published: 22 Oct 2024, Last Modified: 04 Nov 2024NeurIPS 2024 Workshop Open-World Agents OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Imitation Learning, Open World, Visual Prompting
Abstract:

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from both past and present observations to guide policy-environment interactions. Using this approach, we train \agent, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, with real-time object tracking provided by SAM-2. Our method unlocks the full potential of VLMs’ visual-language reasoning abilities, enabling them to solve complex creative tasks, especially those heavily reliant on spatial understanding. Experiments in Minecraft demonstrate that our approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making. Codes and demos will be available on the project page: \url{https://craftjarvis.github.io/ROCKET-1}.

Submission Number: 123
Loading