Abstract: We present Points2Plans, a framework for composable planning with a
relational dynamics model that enables robots to solve long-horizon manipulation
tasks from partial-view point clouds. Given a language instruction and a point cloud
of the scene, our framework initiates a hierarchical planning procedure, whereby a
language model generates a high-level plan and a sampling-based planner produces
constraint-satisfying continuous parameters for manipulation primitives sequenced
according to the high-level plan. Key to our approach is the use of a relational dynamics
model as a unifying interface between the continuous and symbolic representations
of states and actions, thus facilitating language-driven planning from high-dimensional
perceptual input such as point clouds. Whereas previous relational dynamics models
require training on datasets of multi-step manipulation scenarios that align with the
intended test scenarios, Points2Plans uses only single-step simulated training data while
generalizing zero-shot to a variable number of steps during real-world evaluations. We
evaluate our approach on tasks involving geometric reasoning, multi-object interactions,
and occluded object reasoning in both simulated and real-world settings. Results
demonstrate that Points2Plans offers strong generalization to unseen long-horizon tasks
in the real world, where it solves over 85% of evaluated tasks while the next best baseline
solves only 50%. Qualitative demonstrations of our approach operating on a mobile
manipulator platform are made available at sites.google.com/stanford.edu/points2plans.
Loading