Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in videos according to associated audio cues, where both modalities are affected by noise to different extents, such as the blending of background noises in audio or the presence of distracted objects in video.Most existing methods focus on learning interactions between modalities at high semantic levels but is incapable of filtering low-level noise or achieving fine-grained representational interactions during the early feature extraction phase. Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects.In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. SelM employs State Space model for noise reduction and robust feature selection. By imposing additional bidirectional constraints on audio and visual embeddings, it is able to precisely identify crutial features corresponding to sound-emitting targets.To fill the existing gap in early fusion within AVS, SelM introduces a dual alignment mechanism specifically engineered to facilitate intricate spatio-temporal interactions between audio and visual streams, achieving more fine-grained representations. Moreover, we develop a cross-level decoder for layered reasoning, significantly enhancing segmentation precision by exploring the complex relationships between audio and visual information.SelM achieves state-of-the-art performance in AVS tasks, especially in the challenging Audio-Visual Semantic Segmentation setting.Source code will be made publicly available.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work contributes significantly to multimedia and multimodal processing by addressing the critical challenge of segmenting sound-producing objects in noisy audio-visual environments. The innovative architecture, SelM, advances the field by adeptly reducing low-level noise and isolating crucial features for sound-emitting targets, a task that existing methods struggle with due to their focus on high-level semantic interactions. SelM’s bidirectional constraints and dual alignment mechanism enable a nuanced fusion of audio and visual data at early stages, a novel approach that leads to fine-grained spatio-temporal representations. This capability is particularly valuable for multimedia applications that require precise synchronization of sound and image, such as in automated content creation, augmented reality, and advanced surveillance systems. The introduction of a cross-level decoder further enhances the model's interpretative strength, allowing for a more complex understanding of the interplay between auditory and visual cues. By achieving state-of-the-art performance, particularly in complex segmentation scenarios, SelM sets a new benchmark for AVS tasks, promoting more accurate and reliable multimedia analysis. The public availability of the source code will facilitate further research and development, fostering innovation and practical implementations in the field of multimodal processing.
Supplementary Material: zip
Submission Number: 936
Loading