VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

ICLR 2026 Conference Submission1488 Authors

03 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLM, VLA, Empirical Study

TL;DR: We present a simple and effective framework to fairly benchmark different Vision-Language Models (VLMs) as backbones for robotic policies, revealing a notable performance gap that highlights a disconnect between the VLM and VLA domains.

Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLMs) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how the choice and specific capabilities of the underlying VLM affect the performance of VLA policies? We introduce \textbf{VLM4VLA}, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Our pipeline, though simple, proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that a VLM's general capabilities are poor predictors of its downstream task performance, contrary to common assumptions. Inconsistencies across benchmarks suggest that VLA policies require capabilities beyond what current VLMs pursue. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Lastly, our analysis also reveals that the vision encoder is a critical bottleneck, and the ability to fine-tune it is crucial for strong performance. These results highlight a significant gap between current VLM pretraining paradigms and the specific demands of embodied tasks. We will release our code, models, and evaluation logs at \href{https://sites.google.com/view/vlm4vla}{our anonymous website} to encourage further research and help better understanding in this direction.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 1488

Loading