Abstract: In the realm of image captioning (IC), learning sentence pattern and semantics plays a crucial role.
The reason why this aspect has not received enough attention before is that the prevailing IC models utilize the autoregressive IC (AR-IC) paradigm which operates in a word-by-word manner.
In this paradigm, coherence and fluency with the previous text are prioritized during word generation, without special considerations for the sentence pattern.
While effective, the AR-IC approaches pose inherent challenges for real-time applications due to their time-consuming nature during inference.
Unlike the AR-IC counterparts, non-autoregressive IC (NAR-IC) models necessitate simultaneous inference of all words in a caption.
However, the existing NAR-IC models have been met with the hurdle of reduced effectiveness in comparison to their autoregressive counterparts.
It is largely because they follow the AR-IC approach, neglecting the influence of patterns and semantics on NAR-IC.
Considering that the dependency on preceding and following words is eliminated during NAR-IC generation, it becomes crucial to consider the sentence pattern to guide word generation.
In this paper, we reconsider the impact of sentence patterns and semantics in NAR-IC training.
We delve into NAR-IC and provide tips and tricks for training NAR-IC models, which include knowledge distillation, label selection, image pre-fusion, and NAR+AR enhancement. By meticulously examining the impact of these components on model performance, we achieve the state-of-the-art performance with a single-step generation. This paper aims to provide valuable strategies for those aiming to advance NAR-IC models. Our code will be provided in Supplementary materials.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation; image text matching; cross-modal pretraining
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3714
Loading