Keywords: Open-Vocabulary Audio-Visual Segmentation, Multimedia Foundation Models, Large Language Models
Abstract: Audio-visual segmentation (AVS) aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual via text proxy for open-vocabulary AVS. Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text description generation, 2) visual-to-text description generation, 3) LLM-guided prompt translation, and 4) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that harnesses the strengths of appropriate foundation models, thereby maximizing their potential for effective knowledge transfer to downstream AVS tasks. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available.
Comprehensive experiments on four benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of 3.9% ~ 6.7% and 2.2% ~ 4.9% in mIoU and F-score, respectively, in challenging scenarios.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6561
Loading