VITA-E: A Dual-Model Framework for Real-Time, Interruptible, and Concurrent Human-Robot Interaction

Xiaoyu Liu; Chaoyou Fu; Chi Yan; Haihan Gao; YiFan Zhang; Chu Wu; Shaoqi Dong; Cheng Qian; Bin Luo; Yangxiuyong; guanwuli; Yusheng Cai; Yunhang Shen; Deqiang Jiang; Haoyu Cao; Xing Sun; Caifeng Shan; Ran He

VITA-E: A Dual-Model Framework for Real-Time, Interruptible, and Concurrent Human-Robot Interaction

Xiaoyu Liu, Chaoyou Fu, Chi Yan, Haihan Gao, YiFan Zhang, Chu Wu, Shaoqi Dong, Cheng Qian, Bin Luo, Yangxiuyong, guanwuli, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision language action model, Human-computer interaction, Duplex Framework

TL;DR: We introduce VITA-E, a dual-model VLA interaction framework that supports fluent voice interaction and motion control, as well as interruptible human-machine interaction, enabling friendly user communication.

Abstract: Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, limiting their ability to handle real-time user interruptions or perform concurrent tasks such as speaking while acting. This hinders seamless human-robot collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel dual-model framework designed to enable flexible and robust human-robot interaction in real-time. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an **Active Model** and a **Listening Model**, allowing one to instantly intervene in the other. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid robot demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving a 100\% success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable robotic assistants.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 3407

Loading