Sound in Sights: Deriving Visual Insights from Audio for Comprehensive Video Understanding with Large Multimodal Models

Kai Wang; Min Shi; Fengzhe Zhou; Gang Hua; Humphrey Shi

Sound in Sights: Deriving Visual Insights from Audio for Comprehensive Video Understanding with Large Multimodal Models

Kai Wang, Min Shi, Fengzhe Zhou, Gang Hua, Humphrey Shi

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Model, Video Understanding

TL;DR: We propose SoundInSights, a large-scale audio-visual question answering dataset, and a baseline method for video understanding.

Abstract: Video understanding is inherently multimodal, requiring both visual and auditory cues to form a complete representation of dynamic scenes. However, most existing video understanding models rely solely on visual content, overlooking informative audio cues, such as spoken instructions or environmental sounds, for scene understanding and event comprehension. Progress in audio-visual reasoning has been hindered by the lack of high-quality supervised fine-tuning (SFT) data that jointly considers video and audio. To address this gap, we introduce SoundInSight, a large-scale audio-visual question answering dataset comprising over 80k question–answer pairs from online videos, created via a multimodal large language model (MLLM)-assisted annotation pipeline. SoundInSight provides rich supervision for audio-visual reasoning, enabling MLLMs to be fine-tuned for audio understanding. We find that current video MLLMs heavily rely on visual information, hindering effective multimodal learning. To mitigate this, we propose an audio-only pretraining stage, which significantly improves audio-visual reasoning performance. Additionally, to evaluate audio-visual comprehension, we construct a high-quality, manually curated test set of 1,000 samples requiring joint audio-visual understanding, exceeding standard benchmarks in complexity. Models fine-tuned on SoundInSight with the proposed training strategy achieve substantial performance gains on this new benchmark. Moreover, on the challenging VideoMME evaluation, our approach significantly improves performance in Information Synopsis subcategory, demonstrating the efficacy of incorporating audio. The SoundInSight dataset and code will be publicly released to facilitate further research.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Submission Number: 13380

Loading