Exploring Language Prior for Mode-Sensitive Visual Attention Modeling

Xiaoshuai Sun, Xuying Zhang, Liujuan Cao, Yongjian Wu, Feiyue Huang, Rongrong Ji

2020 (modified: 17 Nov 2022)ACM Multimedia 2020Readers: Everyone

Abstract: Modeling human visual attention mechanism is a fundamental problem for the understanding of human vision, which has also been demonstrated as an important module for various multimedia applications such as image captioning and visual question answering. In this paper, we propose a new probabilistic framework for attention, and introduce the concept ofmode to model the flexibility and adaptability of attention modulation in complex environments. We characterize the correlations between the visual input, the activated mode, the saliency and the spatial allocation of attention via a graphical model representation, based on which we explore the lingual guidance from captioning data for the implementation of a mode-sensitive attention (MSA) model. The proposed framework explicitly justifies the usage of center bias for fixation prediction and can convert an arbitrary learning-based backbone attention model to a more robust multi-mode version. Experimental results on the York120, MIT1003 and PASCAL datasets demonstrate the effectiveness of the proposed method.

0 Replies