Object-Aware Audio-Visual Sound Generation

27 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sound Generation, Audio-Visual Learning, Multimodal Learning
TL;DR: We generate object-specific sounds in complex visual scenes using user-provided segmentation masks through self-supervision.
Abstract: Generating accurate sounds for complex audio-visual scenes is challenging, especially when multiple objects and sound sources are present. In this paper, we introduce an object-aware sound generation model that aligns generated sounds with visual objects in a scene. By grounding sound generation in object-centric representations, our model learns to associate specific visual objects with their corresponding sounds. We fine-tune a conditional latent diffusion model with dot-product attention to improve sound-object alignment. At test time, users can compositionally generate sounds by selecting objects via segmentation masks. We theoretically validate our test-time object-grounding ability, ensuring that even subtle sounds can be represented. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10176
Loading