BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Robot Manipulation;Vision-Language Model;Vision-Language-Action Model;
TL;DR: We propose a 3D VLA model that aligns the input and output within a shared 2D space in both pre-training and fine-tuning which enables high data efficiency and achieves impressive performance in both basic and generalization settings.
Abstract: Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low data efficiency. In this paper, we introduce a new paradigm for constructing 3D VLAs. Specifically, we first pre-train the VLM backbone to take 2D images as input and produce 2D heatmaps as output. Using this pre-trained VLM as the backbone, we then fine-tune the entire VLA model while maintaining alignment between inputs and outputs by: (1) projecting raw point cloud inputs into multi-view images, and (2) predicting heatmaps before generating the final action. Extensive experiments show that the resulting model, BridgeVLA, can learn 3D manipulation both efficiently and effectively. BridgeVLA outperforms state-of-the-art baselines across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4\% to 88.2\%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7\% to 64.0\%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32\% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 95.4\% on 10+ tasks with only 3 trajectories per task, while other VLA methods such as $\pi_{0}$ fail completely. Project Website: https://bridgevla.github.io/.
Supplementary Material: zip
Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
Submission Number: 16541
Loading