Abstract: Visual attention is widely applied to image captioning. Previous works put visual attention and linguistic word into a long short-term memory network together, but neglect the sequential relation of attention at different time steps during word prediction. Moreover, the abstraction degree of visual attention is usually different from that of linguistic word. To address these issues, a sequential attention model is proposed in this work to handle visual attention by considering the corresponding sequential relation, and hence the internal relation among attention at each word prediction step is well utilized to enhance the visual information during sentence decoding. The experimental results on the benchmark MSCOCO and Flickr30K datasets show that the proposed model achieves excellent performances with 108.1 and 34.9 respectively on the evaluation criteria of CIDEr and BLEU-4 for MSCOCO.
Loading