Abstract: High-quality multimodal training data is of critical importance for improving of multimodal model performance. However, the utilization of web-crawled vision-caption pairs is hindered by the presence of noise and irrelevance, as well as a lack of Chinese data. Large Language Models (LLM) and Large Multimodal Models (LMM) has demonstrated promising performance in cross-modal understanding and generation. In light of this, we propose a Chinese visual captioning pipeline for the synthesis of high-quality data. Our pipeline is comprised of two phases: the initial training of an encoder for visual understanding; and the subsequent fine-tuning of a captioning model in a two-stage iterative human-in-the-loop process, where the captioning model incorporates the pre-trained vision encoder and LLM by a visual cross-attention querying transformer. Extensive experiments have been conducted to validate our framework, including both quantitative and qualitative evaluation of captions generated from images and videos. The synthesis pipeline has been integrated into the ad image creative generation process in Baidu Search Ads, resulting in enhanced capabilities in prompt following.
Loading