Abstract: Recently, automatic generation of image captions has attracted great interest not only because of its extensive applications but also because it connects computer vision and natural language processing. By combining convolutional neural networks (CNNs), which learn visual representations from images, and recurrent neural networks (RNNs), which translate the learned features into text sequences, the content of a image can be transformed into linguistic sequences. Existing approaches typically focus on visual features extracted form an object-oriented CNN (train on ImageNet) and then decode them into natural language. In this paper, we propose a novel model using not only object-related, but also scene-related information extracted from the images. To make full use of both object and scene information, we first combine object information and scene information (extracted from a scene-oriented CNN), and then using as inputs to RNNs. Both types of information provide complementary aspects that help in generating a more complete description of the image. Qualitative and quantitative evaluation results validate the effectiveness of our method.
0 Replies
Loading