Saliency Prediction of Sports Videos: A Large-Scale Database and a Self-Adaptive Approach

Published: 01 Jan 2024, Last Modified: 02 Aug 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Predicting video saliency is crucial for improving sports video processing efficiency, thereby providing an enriched viewing experience for a wide-ranging audience. However, there is a long-term absence of well-established eye-tracking database and learning-based approach, particularly tailored for sports videos. In this paper, we establish a large-scale eye-tracking database dubbed audio-visual sports (AVS). AVS consists of 1,000 high-quality sports videos with eye fixations from 60 participants. Through the data analysis on AVS, we observe that human attention patterns exhibit significant variations based on the specific scene context of the sports. Motivated by this, we propose a sport-aware audiovisual saliency model, which can adaptively learn the scene context in a hyper manner. Specifically, a new audio-visual fusion (AVF) block is developed to effectively fuse features from the visual and audio backbone. After that, a hyper network is introduced to learn sport-aware priors, which are then adopted to guide the self-adaptive saliency predictor for predicting saliency map. Experimental results demonstrate that our approach outperforms other state-of-the-art saliency prediction models over the only two sports video eye-tracking databases.
Loading