Abstract: Recent neural models on image captioning usually take a encoder-decoder fashion, where the decoder predicts a single word at one step recently with the encoder providing information. The encoder is a pretrained CNN model typically. Thus the decoder, the input to it, and the output from it become the most important parts of a model. We propose a pipelined image captioning framework consisting of two cascaded agents. The former is named as "semantic adaptive agent" which generates the input to the decoder by consulting the information from the current decoding process, and the latter as "caption generating agent" which select a single word of the vocabulary as the output of the decoder by taking consideration of the input and the current states of the decoder. For the framework of two cascaded agents, we design a multi-stage training procedure to train the two agents with different objectives by fully utilizing reinforcement learning. In experiments, we conduct quantitative and qualitative analysis on MS COCO dataset and our results can significantly outperform baseline methods in terms of several evaluation metrics.
0 Replies
Loading