Abstract: Image captioning is a typical task in multimodal learning. Many existing image captioning models rely on autoregressive paradigms, causing notable delays in inference and impacting practical applications. While non-autoregressive methods effectively address the issue of inference delay, there still exists a performance gap when compared to autoregressive models. In this paper, we introduce a dual branch non-autoregressive image captioning model that significantly enhances performance. Firstly, we leverage both region and grid features to fully exploit the fine-grained aspects of the image. To prevent an increase in inference delay, we designed a dual branch network to handle these two features separately. Secondly, we design a word retrieval module to augment the semantic richness of the inputs to the non-autoregressive decoder. Meanwhile, our approach incorporates multiple teacher models in the knowledge distillation process, which aims to preserve the diversity of our model by avoiding reliance on a single autoregressive teacher model. Experiments on the MSCOCO dataset show that our dual branch non-autoregressive image captioning model achieves new state-of-the-art performances, boosting a 128.8% CIDEr score on the ‘Karpathy’ offline test split and delivering a \(17\times \) inference speedup.
Loading