Abstract: In this paper, we propose a novel training method for the transformer encoder-decoder based image captioning, which directly generates a captioning text from an input image. In general, many image- to- text paired datasets need to be prepared for robust image captioning, but such datasets cannot be collected in practical cases. Our key idea for mitigating the data preparation cost is to utilize text-to-text paraphrasing modeling, i.e., a task to convert an input text into different expressions without changing the meaning. In fact, paraphrasing deals with a similar transformation task to image captioning even though paraphrasing tasks have to handle texts instead of images. In our proposed method, an encoder-decoder network trained via the paraphrasing task is directly leveraged for image captioning. Thus, an encoder-decoder network pre-trained by a text-to-text transformation task is transferred into an image-to-text transformation task even though a different modal must be handled in the encoder network. Our experiments using the MS COCO caption datasets demonstrate the effectiveness of the proposed method.
Loading