Keywords: Vision language action model, Human-computer interaction, Duplex Framework
TL;DR: We introduce VITA-E, a dual-model VLA interaction framework that supports fluent voice interaction and motion control, as well as interruptible human-machine interaction, enabling friendly user communication.
Abstract: Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, limiting their ability to handle real-time user interruptions or perform concurrent tasks such as speaking while acting. This hinders seamless human-robot collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel dual-model framework designed to enable flexible and robust human-robot interaction in real-time. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an **Active Model** and a **Listening Model**, allowing one to instantly intervene in the other. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid robot demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving a 100\% success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable robotic assistants.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 3407
Loading