Generalizable Robotic Manipulation: Object-Centric Diffusion Policy with Language Guidance

Published: 24 Jun 2024, Last Modified: 07 Jul 2024EARL 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Imitation Learning, Object-Centric representation, Guided diffusion, Large language model
TL;DR: We propose an effective language-guided collision-aware visuomotor policy that generalizes across diverse aspects for robotic manipulation.
Abstract: Learning from demonstrations faces challenges in generalizing beyond the training data and is fragile even to slight visual variations. To tackle this problem, we introduce Lan-o3dp, a language-guided object-centric diffusion policy that takes 3d representation of task-relevant objects as conditional input and can be guided by cost function for safety constraints at inference time. Lan-o3dp enables strong generalization in various aspects, such as background changes, camera view shift and visual ambiguity, and can avoid novel obstacles that are unseen during the demonstration process. Specifically, We first train a diffusion policy conditioned on point clouds of target objects and then harness a large language model to decompose the user instruction into task-related units consisting of target objects and obstacles, which can be used as visual observation for the policy network or converted to a cost function, guiding the generation of trajectory towards collision free region at test time. Our proposed method shows training efficiency and higher success rates compared with the baselines in simulation experiments. In real-world experiments, our method exhibits strong generalization performance towards unseen instances, cluttered scenes, scenes of multiple similar objects and demonstrates training-free capability of obstacle avoidance.
Submission Number: 8
Loading