Sound in Sights: Deriving Visual Insights from Audio for Comprehensive Video Understanding with Large Multimodal Models

ICLR 2026 Conference Submission13380 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Model, Video Understanding
TL;DR: We propose SoundInSights, a large-scale audio-visual question answering dataset, and a baseline method for video understanding.
Abstract: Video understanding is inherently multimodal, requiring both visual and auditory cues to form a complete representation of dynamic scenes. However, most existing video understanding models rely solely on visual content, overlooking informative audio cues, such as spoken instructions or environmental sounds, for scene understanding and event comprehension. Progress in audio-visual reasoning has been hindered by the lack of high-quality supervised fine-tuning (SFT) data that jointly considers video and audio. To address this gap, we introduce SoundInSight, a large-scale audio-visual question answering dataset comprising over 80k question–answer pairs from online videos, created via a multimodal large language model (MLLM)-assisted annotation pipeline. SoundInSight provides rich supervision for audio-visual reasoning, enabling MLLMs to be fine-tuned for audio understanding. We find that current video MLLMs heavily rely on visual information, hindering effective multimodal learning. To mitigate this, we propose an audio-only pretraining stage, which significantly improves audio-visual reasoning performance. Additionally, to evaluate audio-visual comprehension, we construct a high-quality, manually curated test set of 1,000 samples requiring joint audio-visual understanding, exceeding standard benchmarks in complexity. Models fine-tuned on SoundInSight with the proposed training strategy achieve substantial performance gains on this new benchmark. Moreover, on the challenging VideoMME evaluation, our approach significantly improves performance in Information Synopsis subcategory, demonstrating the efficacy of incorporating audio. The SoundInSight dataset and code will be publicly released to facilitate further research.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 13380
Loading