Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video.
Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases.
In this work, we investigate a novel weakly-supervised multi-modal problem, by simplifying the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel framework for Weakly-Supervised Audio-Visual Segmentation, namely WS-AVS, that can predict pixel-wise masks of high quality for sounding objects without pixel-level annotations.
We conduct extensive experiments on AVSBench dataset. The results demonstrate that the proposed WS-AVS can achieve state-of-the-art weakly-supervised audio-visual segmentation performance.