Abstract: Improving the generalization capabilities of general-purpose robotic manipulation in real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming. However, due to insufficient diversity of data, they typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive real-world experiments conducted on a Franka Emika robot and a low-cost dual-arm robot demonstrate the effectiveness of our proposed paradigm and policy. Demos can be found in link 1 or link 2 and our code will be released at https://github.com/MCG-NJU/TPM.
Loading