Weakly-Supervised Audio-Visual Segmentation

Anonymous NeurIPS 2023 Submission,
Paper ID 1363

WS-AVS predicts sound source masks given both audio and image as input without pixel-level annotations.

Abstract

Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video.

Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases.

In this work, we investigate a novel weakly-supervised multi-modal problem, by simplifying the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel framework for Weakly-Supervised Audio-Visual Segmentation, namely WS-AVS, that can predict pixel-wise masks of high quality for sounding objects without pixel-level annotations.

We conduct extensive experiments on AVSBench dataset. The results demonstrate that the proposed WS-AVS can achieve state-of-the-art weakly-supervised audio-visual segmentation performance.

Qualitative Results

Raw Video
AVS(ws)
CAM
CCAM
EZ-VSL
WS-AVS (ours)