Deep sequential fusion LSTM network for image description

Pengjie Tang, Hanli Wang, Sam Kwong

Published: 2018, Last Modified: 11 Apr 2025Neurocomputing 2018EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: It is a challenging task to perform automatic image description, which aims to translate an image with visual information into natural language conforming to certain proper grammars and sentence structures. In this work, an optimal learning framework called deep sequential fusion based long short term memory network is designed. In the proposed framework, a layer-wise strategy is introduced into the generation process of recurrent neural network to increase the depth of language model for producing more abstract and discriminative features. Then, a deep supervision method is developed to enrich the model capacity with extra regularization. Moreover, the prediction scores from all of the auxiliary branches in the language model are employed to fuse the final decision output with product rule, which further makes use of the optimized model parameters and hence boosts the performance. The experimental results on two public benchmark datasets verify the effectiveness of the proposed approaches, with the consensus-based image description evaluation metric (CIDEr) being 103.4 on the MSCOCO dataset and the metric for evaluation of translation with explicit ordering (METEOR) reaching to 20.6 on the Flickr30K dataset.