Exploring a Self-Attentive Multilayer Cross-Stacking Fusion Model for Video Salient Object Detection

Published: 01 Jan 2023, Last Modified: 11 Apr 2025SMC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As an effective measure to capture the object of interest in video sequence, video salient object detection (VSOD) requires the processing of information from spatial-motion modalities, although plenty of traditional VSOD models were dedicated to developing efficient spatial and motion features to obtain salient objects of global consistency, the highly redundant spatial information brought by consecutive identical objects will inevitably reduce the generalization ability of these VSOD model. Although exploring the integration of spatial and motion information can improve the inter-frame correlation of salient objects to some extent, previous models tend to focus only on simple spatio-temporal fusion, which can also lead to the generation of redundant information, resulting in poor detection performance. Therefore, it is necessary to focus on effectively fusing the feature information of different modalities to eliminate the effect of redundant information. In this research, we proposed a self-attentive multilayer cross-stacking fusion based VSOD model, which productively extracts the multimodal features for two-way information transfer, fully utilizes the spatial and temporal knowledge to complement each other, and refines the cross-stacking of the interacted information and spatial features for local and global saliency optimization. As a result, the redundant spatial information can be largely eliminated, reducing the misidentification of salient objects due to blurred backgrounds or moving objects, and adaptively activating more weights of the salient object to achieve globally consistent saliency. Comprehensive experiments on four publicly available VSOD datasets demonstrated that the model had superior performance compared to the latest multiple VSOD models.
Loading