Based-CLIP early fusion transformer for image caption

Jinyu Guo, Yuejia Li, Guanghui Cheng, Wenrui Li

Published: 01 Jan 2025, Last Modified: 21 Jul 2025Signal Image Video Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Image captioning is a task in the bimodal context of computer vision and natural language processing, where the model outputs textual information captions for given input images. Traditional Transformer architectures based on image encoder and language decoder have shown promising results in the image captioning domain. However, there are still two challenges present: heavy parameters and additional data preprocessing. In this paper, we propose a lightweight based-CLIP early fusion transformer (BCEFT) to tackle this challenge. The BCEFT use CLIP as the data encoder for images and text, then add a multi-modal fusion model to generate image captions. Specifically, the multi-modal fusion model comprises a multi-modal fusion attention module, which reduces computational complexity by more than a half. At last, we utilize reinforcement learning to train our model with beam search algorithm after cross-entropy training. Our approach only requires relatively quick training to produce a high-qualified captioning model. Without the demand for additional annotations or pre-training, it can effectively generate meaningful captions for large-scale and diverse datasets. The experimental results on the MSCOCO dataset demonstrate the superiority of our model. Meanwhile, our model achieves significant efficiency gains, including a nearly 50% decrease in model parameters and an eight-fold improvement in runtime speed.