Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding
Abstract: Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale image and text pretraining.
However, existing VLMs typically lack precise spatial
understanding capabilities, as they are primarily tuned on
2D image-text pairs without 3D supervision. To address
this limitation, recent approaches have incorporated explicit
3D inputs such as point clouds or depth maps, but this
necessitates additional depth sensors or pre-trained depth
estimation models, which may yield defective results. In
contrast, our work introduces a plug-and-play module that
implicitly incorporates 3D geometry features into VLA models
by leveraging an off-the-shelf visual geometry foundation
model. This integration provides the model with depth-aware
visual representations, improving its ability to understand the
geometric structure of the scene and the spatial relationships
among objects from RGB images alone. We evaluate our
method on a set of spatially challenging tasks in both
simulation and the real world. Extensive evaluations show
that our method significantly improves the performance of
state-of-the-art VLA models across diverse scenarios.
Loading