Equivariant Open-vocabulary Pick and Place via Language Kernels and Patch-level Semantic Maps

Mingxi Jia; Haojie Huang; Zhewen Zhang; Chenghao Wang; Linfeng Zhao; Dian Wang; Jason Xinyu Liu; Robin Walters; Robert Platt; Stefanie Tellex

Equivariant Open-vocabulary Pick and Place via Language Kernels and Patch-level Semantic Maps

Mingxi Jia, Haojie Huang, Zhewen Zhang, Chenghao Wang, Linfeng Zhao, Dian Wang, Jason Xinyu Liu, Robin Walters, Robert Platt, Stefanie Tellex

Published: 01 Jul 2024, Last Modified: 08 Jul 2024GAS @ RSS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language-conditioned Robotic Manipulation, Zero-shot Learning, Learning from Demonstrations

TL;DR: Efficient language-conditioned pick and place policy learning

Abstract: Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and struggle with generalization. This paper introduces Grounded Equivariant Manipulation (GEM), a novel approach that leverages the generative capabilities of pre-trained vision-language models and geometric symmetries to facilitate few-shot and zero-shot learning for open-vocabulary robot manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and superior generalization across diverse pick-and-place tasks in both simulation and real-world experiments, showcasing its ability to adapt to novel instructions and unseen objects with minimal data requirements. GEM advances a significant step forward in the domain of language-conditioned robot control, bridging the gap between semantic understanding and action generation in robotic systems.

Submission Number: 11

Loading