Cocktail-Party at the MUSEUM: Referring Audio-Visual Segmentation requires Augmentation

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Referring Audio-Visual Segmentation, Data Augmentation, MLLM
Abstract: Recent advances in Referring Audio-Visual Segmentation (Ref-AVS) have significantly progressed, with the development of multimodal fusion methods and Multimodal Large Language Models (MLLM). However, their modality-specific performance is underexplored, and the effectiveness of audio perception remains unclear. We find that current methods often fail to identify the correct sounding object with audio expressions (e.g., $\textit{loudest sounding object}$), especially at the cocktail-party (i.e., mixed audio source). In addition, MLLM methods tend to memorize through visual-text patterns due to their weaker audio understanding capabilities. To this end, we first propose $\textbf{MISA}$: $\textbf{M}$usical-audio $\textbf{I}$nstructed $\textbf{S}$egmentation $\textbf{A}$ssistant, with an integration of specialized musical-audio encoder MERT, and a musical-specific dataset for alignment to enhance audio tokens' representation. To mitigate the lack of variation of mixed-source signals, we introduce $\textbf{MUSEUM}$, a musical-audio augmentation pipeline consisting of three stages: $\textbf{MU}$sical $\textbf{S}$ourc$\textbf{E}$, A$\textbf{U}$gment, and $\textbf{M}$ix, to respectively perform source separation, sampling from extra musical datasets, and audio augmentation. Our proposed augmentation enriches the mixture of audio signals in the existing training dataset, which facilitates the model learning with diverse samples. Moreover, we refine the existing benchmark as $\textbf{C-Ref-AVSBench}$ that categorizes expressions into Audio-Centric (audio cues), AV-Grounded (audio and visual cues), and Visual-Centric (visual cues), in order to perform modality-specific evaluation. Our approach achieves state-of-the-art performance on both Ref-AVSBench and C-Ref-AVSBench, particularly with the Audio-Centric expressions.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2489
Loading