Multi-Stage Refined Visual Captioning for Baidu Ad Creatives Generation

Yi Yang; Xinyu Zhao; Kang Zhao; Zhipeng Jin; Wen Tao; Lin Liu; Shuanglong Li

Multi-Stage Refined Visual Captioning for Baidu Ad Creatives Generation

Yi Yang, Xinyu Zhao, Kang Zhao, Zhipeng Jin, Wen Tao, Lin Liu, Shuanglong Li

Published: 01 Jan 2024, Last Modified: 13 Jan 2025CIKM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: High-quality multimodal training data is of critical importance for improving of multimodal model performance. However, the utilization of web-crawled vision-caption pairs is hindered by the presence of noise and irrelevance, as well as a lack of Chinese data. Large Language Models (LLM) and Large Multimodal Models (LMM) has demonstrated promising performance in cross-modal understanding and generation. In light of this, we propose a Chinese visual captioning pipeline for the synthesis of high-quality data. Our pipeline is comprised of two phases: the initial training of an encoder for visual understanding; and the subsequent fine-tuning of a captioning model in a two-stage iterative human-in-the-loop process, where the captioning model incorporates the pre-trained vision encoder and LLM by a visual cross-attention querying transformer. Extensive experiments have been conducted to validate our framework, including both quantitative and qualitative evaluation of captions generated from images and videos. The synthesis pipeline has been integrated into the ad image creative generation process in Baidu Search Ads, resulting in enhanced capabilities in prompt following.

Loading