Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li; Baihe Huang; Xiaobin Zhuang; Dongya Jia; Jiawei Chen; Yuping Wang; Zhuo Chen; Gopala Anumanchipalli; Yuxuan Wang

Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We interactively generate sounds specific to user-selected objects within complex visual scenes.

Abstract: Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.

Lay Summary: Real-world scenes often include multiple objects that each make their own sounds—like cars, footsteps and chatter—yet current AI systems can’t isolate these sounds from still images. We present an interactive model that links sounds to objects a user selects in a picture, using segmentation masks that let you click on a visual object and generate its specific sound. We build on a latent diffusion framework and integrate object-centric learning so the system learns which image regions correspond to which sounds. At test time, segmentation masks guide generation, ensuring engine noises come from cars and crowd hum from people. Our theoretical analysis shows that replacing attention with segmentation masks yields equivalent grounding. We evaluate our model with objective measures and human studies: it outperforms existing methods in accuracy, audio quality and user satisfaction. Users can mix sounds from multiple objects to create a coherent soundscape. It also captures interactions like sticks splashing water rather than generic water sounds. This work opens the door to intuitive audio-visual tools for filmmakers, virtual reality and accessible media, making it easy to bring images to life with sound.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/Tinglok/avobject

Primary Area: Applications

Keywords: Sound Generation, Audio-Visual Learning, Multi-Modal Learning

Submission Number: 14971

Loading