Keywords: SE(3) Equivariance; Multi-task Transformer; sample efficient
TL;DR: We propose a novel SE(3) equivariant multi-task transformer that is provably adapts action to 3D scene transformations.
Abstract: Multi-task manipulation policy often builds on transformer's ability to jointly process language instructions and 3D observations in a shared embedding space. However, real-world tasks frequently require robots to generalize to novel 3D object poses. Policies based on shared embedding break geometric consistency and struggle in 3D generation. To address this issue, we propose EquAct, which is theoretically guaranteed to generalize to novel 3D scene transformations by leveraging SE(3) equivariance shared across both language, observations, and action. EquAct makes two key contributions: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. Finally, EquAct demonstrates strong spatial generalization ability and achieves state-of-the-art across $18$ RLBench tasks with both SE(3) and SE(2) scene perturbations, different amounts of training data, and on $4$ physical tasks.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 14204
Loading