Exploring Non-Autoregressive Image Captioning: Patterns and Semantics

Exploring Non-Autoregressive Image Captioning: Patterns and Semantics

ACL ARR 2024 June Submission3714 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the realm of image captioning (IC), learning sentence pattern and semantics plays a crucial role. The reason why this aspect has not received enough attention before is that the prevailing IC models utilize the autoregressive IC (AR-IC) paradigm which operates in a word-by-word manner. In this paradigm, coherence and fluency with the previous text are prioritized during word generation, without special considerations for the sentence pattern. While effective, the AR-IC approaches pose inherent challenges for real-time applications due to their time-consuming nature during inference. Unlike the AR-IC counterparts, non-autoregressive IC (NAR-IC) models necessitate simultaneous inference of all words in a caption. However, the existing NAR-IC models have been met with the hurdle of reduced effectiveness in comparison to their autoregressive counterparts. It is largely because they follow the AR-IC approach, neglecting the influence of patterns and semantics on NAR-IC. Considering that the dependency on preceding and following words is eliminated during NAR-IC generation, it becomes crucial to consider the sentence pattern to guide word generation. In this paper, we reconsider the impact of sentence patterns and semantics in NAR-IC training. We delve into NAR-IC and provide tips and tricks for training NAR-IC models, which include knowledge distillation, label selection, image pre-fusion, and NAR+AR enhancement. By meticulously examining the impact of these components on model performance, we achieve the state-of-the-art performance with a single-step generation. This paper aims to provide valuable strategies for those aiming to advance NAR-IC models. Our code will be provided in Supplementary materials.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation; image text matching; cross-modal pretraining

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 3714

Loading