Long-Term Recurrent Merge Network Model for Image Captioning

Published: 2018, Last Modified: 18 Dec 2024ICTAI 2018EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Language models based on Recurrent Neural Networks, e.g. Long Short Term Memory Network (LSTM), have shown strong ability in generating captions from image. However, in previous LSTM-based image captioning models, the image information is input to LSTM at 0th time step, and the network gradually forgets the image information, and only uses the language model to generate a simple description, leaving the potential in generating a better description. To address this challenge, in this paper, a Long-term Recurrent Merge Network (LRMN) model is proposed to merge the image feature at each step via a language model, which not only can improve the accuracy of image captioning, but also can describe the image better. Experimental results show that the proposed LRMN model has a promising improvement in image captioning.
Loading