STM-SalNet: A Biologically-Inspired Spatial-Temporal Memory Network for Video Saliency Prediction

Jikai Xu, Dandan Zhu, Kaiwei Zhang, Xiongkuo Min

Published: 01 Jan 2026, Last Modified: 26 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: In recent years, video saliency prediction has attracted significant attention across a wide range of vision-related tasks. However, most existing video saliency prediction methods predominantly rely on static encoder-decoder architectures, failing to incorporate the dynamic memory mechanisms that are fundamental to human visual perception and attention modeling. To address this limitation, we propose STM-SalNet, a novel biologically-inspired spatial-temporal memory network for video saliency prediction. First, inspired by the powerful visual processing capabilities of the human visual cortex, we introduce a brain-inspired Vision Transformer module designed to extract multi-level hierarchical spatial-temporal features. Subsequently, we propose a memory bank module equipped with an active forgetting mechanism, simulating human memory’s ability to selectively retain and update information. By dynamically retrieving relevant features from past frames while discarding redundancy, the module ensures robust adaptability to continuously evolving video content. To further enhance the integration of spatial and temporal features, we design a bidirectional spatial-temporal fusion module that facilitates effective interaction between deep semantic and shallow spatial features, enriching the overall feature representation. Finally, a progressively hierarchical decoder module is employed to generate fine-grained, pixel-wise saliency maps that closely align with ground truths. Extensive experiments on the DHF1K, Hollywood-2, and UCF-Sports benchmark datasets demonstrate that our proposed STM-SalNet achieves competitive performance compared to existing state-of-the-art methods.

External IDs:doi:10.1007/978-981-95-4097-6_23