Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia Advertising

Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li

2022 (modified: 10 Feb 2023)CIKM 2022Readers: Everyone

Abstract: Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.

0 Replies