CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective

Junwen Xiong

Published: 16 Jun 2023, Last Modified: 23 May 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mech anism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vi sion and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency aware audio-visual saliency prediction network (CASP Net) is proposed, which takes a comprehensive considera tion of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and vi sual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substan tial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six chal lenging audio-visual eye-tracking datasets. For a demo of our system please see our project webpage.