VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

ACL ARR 2025 May Submission547 Authors

13 May 2025 (modified: 05 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Although current mainstream pre-trained large models, such as LLM models represented by ChatGPT and VLA models represented by OpenVLA, have achieved significant progress in multimodal tasks through a "Multiple-Input, Single-Output" (MISO) architecture. However, our investigation reveals that the MISO architecture exhibits fundamental limitations in "Multiple-Input, Multiple-Output" (MIMO) (e.g., parallel multi-tasks output processing): the architecture generates task mutual exclusion effects, leading to resource contention among different tasks when sharing output channels, and consequently resulting in optimization imbalance and performance. In contrast, human MIMO processing inherently enables concurrent task execution (e.g., while dialogue and decision-making) without interference. Inspired by this, in this work, we propose a unified MIMO training model with parallel multi-tasks output capabilities—the Visual Language Action Model for Simultaneously Chatting and Decision Making (VLASCD). We evaluate the model on the CARLA autonomous driving platform. The results show that, compared to LLM models with MISO dialogue capabilities, reinforcement learning models, and VLA models with MISO decision-making capabilities, VLASCD significantly outperforms existing MISO models in simultaneously handling dialogue generation and decision-making tasks within the MIMO scenario.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: large language model, vision language action model; multimodality; robotics;

Contribution Types: Model analysis & interpretability, Reproduction study

Languages Studied: english

Submission Number: 547

Loading