Abstract: Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.
External IDs:dblp:journals/tmm/ZhuLY25
Loading