Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Tianxiang Chen; Zhentao Tan; Qi Chu; Yue Wu; Bin Liu; Nenghai Yu

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Tianxiang Chen, Zhentao Tan, Qi Chu, Yue Wu, Bin Liu, Nenghai Yu

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: audio-visual segmentation, multi-modality interaction

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: How to effectively interact audio with vision has always attracted extensive interest in the multi-modality community. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames using audio cues. However, current AVS methods suffer a modality imbalance issue. The fusion of audio features is insufficient because of its unidirectional and deficient interaction, while the vision information is more sufficiently exploited. Thus, the output features are always dominated by visual representations, which restricts audio-vision representation learning and may cause some false alarms. To address this issue, we propose AVSAC, where a Bidirectional Audio-Visual Decoder (BAVD) is devised with multiple bidirectional bridges built within. This strengthens audio cues and enables continuous interaction between audio and visual representations, which shrinks modality imbalance and boosts audio-visual representation learning. Furthermore, we introduce Audio Feature Reconstruction (AFR) to evade harmful data bias and curtail audio information loss by reconstructing lost ones from visual signals. Extensive experiments show that our method achieves new state-of-the-art performances in the AVS benchmark, especially boasting significant improvements (about 6$\%$ in mIoU and 4$\%$ in F-score) in the most challenging MS3 subset which needs to segment multiple sound sources.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1835

Loading