Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization

Yanhao Jia; Ji Xie; S Jivaganesh; Li Hao; Xu Wu; Mengmi Zhang

Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization

Yanhao Jia, Ji Xie, S Jivaganesh, Li Hao, Xu Wu, Mengmi Zhang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal, representation alignment, neuroscience, human behavior, sound localization, auditory processing

TL;DR: We present a unified audio-visual framework to uncover how humans and AI respond to modality conflicts and bias.

Abstract: Imagine hearing a dog bark and instinctively turning toward the sound—only to find a parked car, while a silent dog sits nearby. Such moments of sensory conflict challenge perception, yet humans flexibly resolve these discrepancies, prioritizing auditory cues over misleading visuals to accurately localize sounds. Despite the rapid advancement of multimodal AI models that integrate vision and sound, little is known about how these systems handle cross-modal conflicts or whether they favor one modality over another. Here, we systematically and quantitatively examine modality bias and conflict resolution in AI models for Sound Source Localization (SSL). We evaluate a wide range of state-of-the-art multimodal models and compare them against human performance in psychophysics experiments spanning six audiovisual conditions, including congruent, conflicting, and absent visual and audio cues. Our results reveal that humans consistently outperform AI in SSL and exhibit greater robustness to conflicting or absent visual information by effectively prioritizing auditory signals. In contrast, AI shows a pronounced bias toward vision, often failing to suppress irrelevant or conflicting visual input, leading to chance-level performance. To bridge this gap, we present EchoPin, a neuroscience-inspired multimodal model for SSL that emulates human auditory perception. The model is trained on our carefully curated AudioCOCO dataset, in which stereo audio signals are first rendered using a physics-based 3D simulator, then filtered with Head-Related Transfer Functions (HRTFs) to capture pinnae, head, and torso effects, and finally transformed into cochleagram representations that mimic cochlear processing. To eliminate existing biases in standard benchmark datasets, we carefully controlled the vocal object sizes, semantics, and spatial locations in the corresponding images of AudioCOCO. EchoPin outperforms existing models trained on standard audio-visual datasets. Remarkably, consistent with neuroscience findings, it exhibits a human-like localization bias, favoring horizontal (left–right) precision over vertical (up–down) precision. This asymmetry likely arises from HRTF-shaped and cochlear-modulated stereo audio and the lateral placement of human ears, highlighting how sensory input quality and physical structure jointly shape precision of multimodal representations. All code, data, and models are available \href{https://github.com/CuriseJia/SSHS}{here}.

Primary Area: Neuroscience and cognitive science (e.g., neural coding, brain-computer interfaces)

Submission Number: 860

Loading