Scalable Vision-Language-Action Models for General-Purpose Robotics

Xiaonan Song

Scalable Vision-Language-Action Models for General-Purpose Robotics

Xiaonan Song

Published: 24 May 2026, Last Modified: 24 May 2026ScaleBot @ CVPR 2026EveryoneRevisionsCC BY 4.0

Keywords: Scalable Robotic Models, VLA Model, Pretraining Datasets, Multimodal Datasets, Few-Shot Task Completion, Hyperparameter Optimization, Computational Efficiency, Task-Specific Performance, Vision-Language Integration

TL;DR: Scaling up vision-language-action models boosts robot performance, but efficiency is key: our model excels in task generalization but faces a trade-off between computational cost and accuracy, requiring careful balancing for real-world use.

Abstract: This study introduces a Vision-Language-Action (VLA) model designed to address the challenges of general-purpose robotics, with a specific emphasis on its scalability and generalization capabilities across a diverse range of tasks. The core methodology involves pretraining the VLA model on extensive multimodal datasets, enabling it to learn rich representations of the environment and task-relevant information. The evaluation process includes a series of ablation studies to assess the contribution of different components of the model and hyperparameter tuning to optimize its performance. Preliminary results demonstrate the model's potential as a foundational architecture for developing more versatile and adaptable robotic systems. Further investigation is warranted to explore its limitations and potential for real-world deployment.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 25

Loading