OpenAVS: Training-Free Open-Set Audio Visual Segmentation with Foundational Models

OpenAVS: Training-Free Open-Set Audio Visual Segmentation with Foundational Models

ICLR 2026 Conference Submission6561 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Open-Vocabulary Audio-Visual Segmentation, Multimedia Foundation Models, Large Language Models

Abstract: Audio-visual segmentation (AVS) aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual via text proxy for open-vocabulary AVS. Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text description generation, 2) visual-to-text description generation, 3) LLM-guided prompt translation, and 4) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that harnesses the strengths of appropriate foundation models, thereby maximizing their potential for effective knowledge transfer to downstream AVS tasks. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on four benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of 3.9% ~ 6.7% and 2.2% ~ 4.9% in mIoU and F-score, respectively, in challenging scenarios.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6561

Loading