3DS-VLA: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language-Action; Robotic Manipulation; Imitation Learning
Abstract: Recently, 2D vision-language-action (VLA) models have made significant strides in multi-task manipulation. However, these models struggle to reason about 3D spatial relationships from 2D image inputs. Although an increasing number of 3D approaches explicitly integrate 3D information, they encounter challenges such as limited availability of large-scale 3D datasets and loss of spatial information during input processing. Meanwhile, existing policies typically focus on the perception-to-action learning paradigm, lacking an explicit understanding of the spatial and temporal relationships between the robot and its environment. To address this, we propose 3DS-VLA, which enhances pretrained 2D vision-language models (VLMs) with comprehensive 3D awareness, enabling the prediction of robust end-effector poses. Specifically, we enable a 2D vision encoder to encode both 2D images and 3D spatial observation by introducing a 2D-to-3D positional alignment mechanism. This allows 3DS-VLA to leverage the large-scale pre-trained knowledge of the VLM for effective reasoning in complex 3D robotic environments. Furthermore, to better understand the spatiotemporal relationship between 3D observations and robot behavior, we guide the model to learn the introduced sequential 3D spatial constraints, which define affordance-relevant 3D keypoints on objects, ensuring robust interactions. Experiments in simulated and real-world demonstrate that 3DS-VLA outperforms previous state-of-the-art policies and showcase its generalizable capabilities across multi-task, multi-embodiment, and diverse environmental settings.
Supplementary Material: zip
Spotlight: zip
Submission Number: 71
Loading