Keywords: Audio-visual segmentation, audio-visual semantic segmentation, image segmentation
Abstract: The audio-visual segmentation task aims to segment sounding objects associated with the corresponding audio in visual data. Unlike conventional supervised approaches, this paper presents a method that does not require ground-truth audio-visual masks during training. The proposed framework consists of three decoupled stages: (1) segmenting category and audio-agnostic objects solely from an input image, (2) associating input audio and segmented object masks to obtain the corresponding mask to the audio, and (3) classifying the object mask. We leverage the pretrained segmentation and vision-language foundation models in the segmentation and classification stages, respectively, and the audio-mask association module in the second stage is trained without relying on ground-truth correspondence between audio and object masks via a multiple-instance contrastive learning scheme. In the association module, we propose object mask representation to incorporate the local and global information of object masks and training framework to enhance the segmentation performance on the multi-source audio inputs. Our approach significantly outperforms previous unsupervised and weakly-supervised audio-visual source localization and segmentation methods. Furthermore, our approach achieves a comparable performance to the supervised audio-visual semantic segmentation baseline.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1512
Loading