DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong

Published: 16 Jun 2024, Last Modified: 23 May 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Audio-visual saliency prediction can draw support from diverse modality complements, but further performance en hancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency predic tion (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual fea tures, an extra network Saliency-UNet is designed to per form multi-modal attention modulation for progressive re f inement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six chal lenging audio-visual benchmarks, with an average rela tive improvement of 6.3% over the previous state-of-the art results by six metrics. The project url is https: //junwenxiong.github.io/DiffSal.