Non-Autoregressive Image Captioning with Multi-Label Classification and Self-Critical Sequence Training

Published: 2025, Last Modified: 21 Jan 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Most current image captioning models rely on the autoregressive approach, which unfortunately results in significant inference delays that hinder their practical use. In contrast, non-autoregressive methods show promising potential for increasing inference speeds. However, there is often a performance gap between the non-autoregressive and autoregressive image captioning models due to issues like word repetition and semantic inconsistencies in the generated captions. Autoregressive models benefit from self-critical sequence training, which helps produce more coherent and fluid captions. While non-autoregressive models are difficult to benefit from as they predict words independently. In this paper, we introduce a two-stage training strategy designed to harness self-critical sequence training for enhancing the non-autoregressive image captioning model. Our approach initially treats the image captioning task as a multi-label classification problem, which allows for the stable production of multiple candidate captions. In the second stage, we employ these candidate captions to compute sequence-level evaluation metric scores that serve as reward scores for self-critical sequence training. Extensive experiments demonstrate the effectiveness of our proposed method and show that our model achieves a new state-of-the-art performance in inference accuracy and speed.
Loading