Abstract: Dynamic image advertising is an add-on service in search advertising that matches visuals to search ads in real-time. However, the image matching system encompasses various sub-tasks with different objectives, increasing the complexity of achieving global optimization. Besides, prevalent long-tailed data poses a challenge to the multimodal representation learning in dynamic image advertising. Recently, vision-language pre-trained models have achieved remarkable performance across a variety of multimodal tasks, and implemented as the foundational representation model in electronic business scenarios. In this paper, to improve multimodal content understanding in Dynamic Image adVERtising, we present a viSion-language rEpresentation model (referred to as DIVERSE) that learns on cross-view and cross-token contrastive loss. Moreover, with large-scale curated advertising image-text data and extensive efficient training techniques, we scale DIVERSE to 12 billion parameters, which is the biggest Chinese multimodal representation model in industrial practices. Experiment results demonstrate the distinct advantages of DIVERSE12B in business datasets, with competitive performance on public benchmarks. Further evaluation in downstream applications including ad text-image retrieval, text-image relevance modeling, and image content moderation, shows that it outperforms previous separately-trained models across offline and online metrics. Moreover, DIVERSE12B has been implemented on the system primary traffic of Baidu Search Ads, bringing considerable increase to both user experience, and revenue for advertisers and search engine.
Loading