Instance-Level Panoramic Audio-Visual Saliency Detection and Ranking

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Panoramic audio-visual saliency detection is to segment the most attention-attractive regions in 360° panoramic videos with sound. To meticulously delineate the detected salient regions and effectively model human attention shift, we extend this task to more fine-grained instance scenarios: identifying salient object instances and inferring their saliency ranks. In this paper, we propose the first instance-level framework that can simultaneously be applied to segmentation and ranking of multiple salient objects in panoramic videos. Specifically, it consists of a distortion-aware pixel decoder to overcome panoramic distortions, a sequential audio-visual fusion module to integrate audio-visual information, and a spatio-temporal object decoder to separate individual instances and predict their saliency scores. Moreover, owing to the absence of such annotations, we create the ground-truth saliency ranks for the PAVS10K benchmark. Extensive experiments demonstrate that our model is capable of achieving state-of-the-art performance on the PAVS10K for both saliency detection and ranking tasks. The code and dataset will be released soon.
Primary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our paper contributes to multimedia/multimodal processing as following: 1. We introduce a novel task of instance-level saliency detection and ranking in 360° panoramic videos with sound. This extends the traditional saliency detection task to a more comprehensive understanding of salient objects in immersive multimedia content. 2. We propose a sequential audio-visual fusion module to activate sounding regions and successively perform instance-aware cross-modal fusion. This new fusion strategy can provide insights for the field of multimodal fusion. 3. The distortion-aware pixel decoder addresses the challenge posed by panoramic distortions, ensuring accurate saliency detection in 360° videos. This contributes to the development of processing techniques tailored for immersive multimedia formats. 4. The creation of ground-truth saliency ranks for the PAVS10K benchmark contributes a valuable resource for future research in this area, promoting the development and evaluation of advanced multimodal saliency detection methods.
Supplementary Material: zip
Submission Number: 2387
Loading