SVLA: A Unified Speech-Vision-Language Model for Multimodal Reasoning and Generation

ACL ARR 2025 May Submission6875 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Vision-Language Models have shown impressive capabilities in tasks such as image captioning, visual question answering, and cross-modal retrieval. However, there are still significant challenges that need to be addressed in order to fully unlock the potential of these models. First, integrating speech, text, and vision into a unified model is particularly difficult for tasks like Spoken Image Captioning and Spoken Visual Question Answering, where the interaction between these modalities introduces additional complexity. Second, existing speech generation approaches differ—some generate speech directly, while others use an intermediate text step—but their impact on fluency, coherence, and accuracy remains unexplored. To address these challenges, we propose SVLA, a unified Speech-Vision-Language Assistant based on a decoder-only transformer architecture that seamlessly integrates multimodal inputs and outputs. We enhance model performance with a large-scale speech-text-image dataset containing 38.2 million examples and 64.1 hours of TTS-generated speech. Our approach advances multimodal understanding and generation, facilitating more effective integration of speech, text, and vision (http://github.com/vlm-svla/svla).
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: speech and vision; QA via spoken queries;multimodality
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: englishs
Submission Number: 6875
Loading