Abstract: The goal of image captioning via machine learning is to automatically learn to provide a free-form description of an image, while focusing on the significant objects in an image. Inspired by recent work on attention in image captioning, we study in this paper different attention mechanisms within a deep learning setting. In contrast to previous research on attention models which focus on applying attention to the image modality, we introduce three language-based attention models. These language-based attention models, which we developed iteratively from simpler RNN-and LSTM-based baseline models, consist of two sub-networks: a deep recurrent neural network for the language modality and a convolutional neural network for the image modality. The language-based attention models learn a joint representation of the language and image modalities, given the image and the previous words in the caption. At test time, novel captions are produced from this learned distribution. We provide a comparative quantitative and qualitative analysis of our three language-based attention models, which outperform the simple baseline models. We validate the effectiveness of our attention models with state-of-the-art performance on the Flickr8k dataset.
External IDs:dblp:conf/dsaa/RajendraRMZH18
Loading