Abstract: Predicting salient regions in images requires the capture of contextual information in the scene. Conventional saliency models typically use the encoder–decoder architecture and multiscale feature fusion for modeling contextual features, which, however, possess huge computational cost and model parameters. In this article, we address the saliency prediction task by capturing long-range dependencies based on the self-attention mechanism. Self-attention has been widely used in image recognition or other classification tasks, but is still rarely being considered in regression-based saliency prediction task. Inspired by the nonlocal block, we propose a new saliency prediction network in which deep convolutional network is integrated with the attention mechanism, namely, SalDA. Considering each feature map may capture different salient regions, our spatial attention module first adaptively aggregates the feature at each position by a weighted sum of the features at all positions within each independent channel. Meanwhile, in order to capture interdependence between channels, we also introduce a channel attention module to integrate different features among different channels. We combine these two attention modules into a multiattention module to further improve the saliency map prediction for the network. We show the effectiveness of SalDA on the largest saliency prediction data set SALICON. Compared to other state-of-the-art methods in this area, we can yield comparable saliency prediction performance, but with substantially less model parameters and shorter inference time.
Loading