Abstract: Audio-visual saliency prediction can draw support from
diverse modality complements, but further performance en
hancement is still challenged by customized architectures
as well as task-specific loss functions. In recent studies,
denoising diffusion models have shown more promising in
unifying task frameworks owing to their inherent ability of
generalization. Following this motivation, a novel Diffusion
architecture for generalized audio-visual Saliency predic
tion (DiffSal) is proposed in this work, which formulates
the prediction problem as a conditional generative task of
the saliency map by utilizing input audio and video as the
conditions. Based on the spatio-temporal audio-visual fea
tures, an extra network Saliency-UNet is designed to per
form multi-modal attention modulation for progressive re
f
inement of the ground-truth saliency map from the noisy
map. Extensive experiments demonstrate that the proposed
DiffSal can achieve excellent performance across six chal
lenging audio-visual benchmarks, with an average rela
tive improvement of 6.3% over the previous state-of-the
art results by six metrics. The project url is https:
//junwenxiong.github.io/DiffSal.
Loading