CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective
Abstract: Incorporating the audio stream enables Video Saliency
Prediction (VSP) to imitate the selective attention mech
anism of human brain. By focusing on the benefits of
joint auditory and visual information, most VSP methods
are capable of exploiting semantic correlation between vi
sion and audio modalities but ignoring the negative effects
due to the temporal inconsistency of audio-visual intrinsics.
Inspired by the biological inconsistency-correction within
multi-sensory information, in this study, a consistency
aware audio-visual saliency prediction network (CASP
Net) is proposed, which takes a comprehensive considera
tion of the audio-visual semantic interaction and consistent
perception. In addition a two-stream encoder for elegant
association between video frames and corresponding sound
source, a novel consistency-aware predictive coding is also
designed to improve the consistency within audio and vi
sual representations iteratively. To further aggregate the
multi-scale audio-visual information, a saliency decoder is
introduced for the final saliency map generation. Substan
tial experiments demonstrate that the proposed CASP-Net
outperforms the other state-of-the-art methods on six chal
lenging audio-visual eye-tracking datasets. For a demo of
our system please see our project webpage.
Loading