RoboOmni: Actions Are Just Another Modality for Your Vision-Language Models

ICLR 2026 Conference Submission16451 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Action Model, Multi-Modal Learning, Manipulation
Abstract: Integrating Vision--Language-Models (VLMs) into robotics has enabled building generalizable Vision-Language Action (VLA) models for robotic manipulation. While decoupled designs with a separate action expert often outperform unified frameworks, the latter (e.g., OpenVLA) present an appealing, conceptually integrated architecture. Nevertheless, current unified approaches typically suffer from poor historical context integration and distribution shift given their incapability of predicting action chunking. We introduce **RoboOmni**, a unified multi-modal next-token prediction framework for robotic manipulation designed to overcome these issues. Compared with decoupled approaches, **RoboOmni** unifies the multi-modal representations and minimizes the distribution gap between vision-language pretraining and action finetuning. Besides, in contrast to prior unified approaches, **RoboOmni** brings in the action chunking mechanism by *Multi-Token Action Prediction* (MTAP) that supports both FAST and Bin tokenizers, and crucially alleviates the action distribution shift issue when training with noisy real-world data. Specifically, by preserving the original VLM training pipeline, **RoboOmni** naturally supports co-training with multi-modal information and various VLM optimization techniques, *e.g.,* fast inference optimization, which significantly improves the generalization capabilities and extensibility of **RoboOmni**. We conduct extensive experiments on both the CALVIN benchmark and a real-world robot, demonstrating state-of-the-art (SOTA) performance. Our MTAP implementation with the FAST tokenizer achieves a 94.4% average success rate on CALVIN. Furthermore, we show that our Bin tokenizer implementation, deployed with existing VLM serving frameworks like SGLang, achieves a 27x inference time speedup compared with OpenVLA.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 16451
Loading