Video salient object detection via self-attention-guided multilayer cross-stack fusion

Published: 01 Jan 2024, Last Modified: 11 Apr 2025Multim. Tools Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video salient object detection is an effective measure for identifying objects of interest within video sequences, which requires the processing of information from spatial-motion patterns. While plenty of traditional video salient object detection models have historically aimed to develop efficient features capturing spatial and motion features to obtain salient objects of global consistency, the presence of repetitive spatial information from consecutive identical objects can reduce the models’ generalization ability and stability. Previous attempts have been made to integrate spatial and motion information to enhance the inter-frame correlation of salient objects, but they often focus on simple spatio-temporal fusion, inadvertently introducing redundant information and leading to suboptimal detection performance. Therefore, there is a need to shift the focus towards more effective fusion of the feature information from different modalities to mitigate the negative effects of redundant information. In this study, we propose a self-attention guided fully supervised multilayer cross-stack fusion for video salient object detection, which extracts multimodal features for facilitating bidirectional information transfer. Our approach utilizes spatial and temporal knowledge to complement each other, refining the cross-stack of the interacted information and spatial features to optimize local and global saliency. Consequently, the approach significantly reduces redundant spatial information, mitigating the misidentification of salient objects due to blurred backgrounds or moving objects. Additionally, it adaptively activates more weights of the salient object to achieve globally consistent saliency. Extensive experiments on five publicly available video salient object detection datasets demonstrated that the performance of our approach was superior to those of multiple state-of-the-art video salient object detection models.
Loading