Keywords: audio-visual speech, multimodal reasoning, mllm
TL;DR: Dataset and baseline for a new task: audio-visual speech understanding
Abstract: Audio-visual speech processing leverages visual cues (\eg~lip movements) to enhance speech robustness in noisy environments. However, current research is heavily focused on Audio-Visual Speech Recognition (AVSR), which primarily addresses the surface-level task of transcription, overlooking the need for deeper semantic understanding under challenging auditory conditions. To bridge this gap, we introduce \textbf{Audio-Visual Speech Understanding (AVSU)}, a new task that aims to comprehend semantics and context beyond mere transcription.
To support AVSU, we build \textbf{AVSU-Bench}, a large-scale dataset with 50k question-answer pairs aligned with audio-visual speech videos.
We further propose \textbf{VSpeech-R1}, the first-ever end-to-end multimodal large language model tailored for AVSU. A key component of this model is VSpeech-CoT, a structured Chain-of-Thought reasoning framework enabled by a training strategy combining supervised cold-starting and reinforcement learning.
Extensive evaluations on AVSU-Bench demonstrate that our end-to-end framework consistently outperforms traditional cascaded pipelines. Specifically, VSpeech-R1 achieves a BERTScore of 92.43\%, an absolute improvement of 2.33\% over the best cascaded baseline.
Primary Area: datasets and benchmarks
Submission Number: 852
Loading