Abstract: For video salient object detection (VSOD) tasks, the geometric variations of object foregrounds and backgrounds across multiple scales pose significant challenges for deep learning models in extracting and integrating semantic features from video streams. Current deep learning approaches, such as recurrent neural networks and transformers, struggle to capture both short- and long-term temporal dependencies at a global level due to their fixed kernel structures. Additionally, these methods are computationally intensive, limiting their practical application. To address these challenges and achieve a balance between accuracy and computational efficiency, a novel lightweight Deformable Multi-scale Fusion Network is proposed, which extracts both attention-based multi-scale features and geometric features together to generate the efficient saliency map. Further, the Geometric Multi-Scale Pixel-level Contrastive Learning (GMPCL) approach, which enhances the geometric representation of features is proposed using GMPCL loss and separates the geometric representations of foreground and background features of objects at the pixel level. The performance evaluation is done on six benchmark datasets and compared with twenty-two state-of-the-art (SOTA) models. The main highlight of this work is that it performs well on most challenging datasets DAVSOD-Difficult as compared to SOTA models and has 6.2 million network parameters, 5.6 G FLOPS, and 90 FPS inference speed.
Loading