Keywords: 3D manipulation, Imitation Learning, Coarse-to-fine Policy
Abstract: Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation.
However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues.
To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation.
Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability.
Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12\% higher average success rate than the SOTA method while using only 1/5 of the training trajectories.
In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 17709
Loading