Abstract: The transformer-based image captioning models have shown remarkable performance based on the powerful sequence modeling capability. However, most of them focus only on learning deterministic mappings from image space to caption space, i.e., learning how to improve the accuracy of predicting average" captions, which generally leads to common words, repeated phrases and single sentence. In this paper, we propose a novel multi-feature fusion based sequential variational transformer for diverse image captioning (MF-SVT-DIC), aiming to learn one-to-many projections and improve the diversity and accuracy of generated captions simultaneously. Specifically, we incorporate sequential variational inference into the traditional transformer-based captioning model to model the word-level diversity. Meanwhile, we design a fusion module to take advantage of both grid features and region features, facilitating the generation of fine-grained captions. The experimental results demonstrate that our method achieves significant gains in term of both diversity and accuracy compared with the state-of-the-art diverse image captioning models.
Loading