Abstract: Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: \textbf{open-vocabulary audio-visual semantic segmentation}, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: In this work, we propose a novel multi-modal task, open-vocabulary audio-visual semantic segmentation (AVSS), which aims at segmenting and classifying sound-emitting objects in videos from open-set categories. To accomplish this task, we develop the first open-vocabulary AVSS framework to predict unseen categories with the help of abundant knowledge from large-scale pre-trained vision-language models. This is beneficial for various applications such as automated video indexing, content-based video retrieval, and enhanced accessibility services. Moreover, the strong zero-shot generalization ability demonstrated by the proposed model in experiments indicates its robustness and adaptability, which are essential qualities for practical applications in multimedia processing. This capability ensures that the system can effectively handle diverse and evolving content, thereby making it more versatile and useful in real-world scenarios. We hope that it can further promote the generalization study of audio-visual segmentation in zero-shot, open-vocabulary and real-world scenarios.
Supplementary Material: zip
Submission Number: 4882
Loading