Bridging Audio-Visual Semantics with Language-Guided Synthesis

Zihao Wei; Zixuan Pan; Yidong Huang; Ziqiao Ma; Ziyang Chen; Joyce Chai; Andrew Owens

Bridging Audio-Visual Semantics with Language-Guided Synthesis

Zihao Wei, Zixuan Pan, Yidong Huang, Ziqiao Ma, Ziyang Chen, Joyce Chai, Andrew Owens

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodality, self-supervised learning, audio-vision

Abstract: One of the underlying assumptions behind audio-visual learning models is that the two modalities convey overlapping information. However, this assumption is widely violated in practice, which results in degraded performance. To address this problem, we propose to replace mismatched audio-visual signals using cross-modal generative models. Our approach uses language-based supervision to perform this generation. We show that data synthetically generated through this process is well-suited for a variety of representation learning methods. The features that we learn this way outperform those trained solely on real data for a range of downstream tasks, including audio classification, audio-visual retrieval, and visual sound localization.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6273

Loading