Abstract: Cross-modal retrieval has achieved significant progress in recent years with the help of token embeddings interaction methods. Most existing methods first extract embedding for each token of input image and text, then feed the token-level embeddings into a multi-modal transformer to learn a joint representation, this joint representation can be used to predict matching score between input image and text. However, these methods don't explicitly supervise the alignment between visual and textual tokens. In this paper, we propose a novel Token Embeddings AlignMent (TEAM) block, it first explicitly aligns visual tokens and textual tokens, then produces token-level matching scores to measure fine-grained similarity between input image and text. TEAM achieves new state-of-the-art performance on commonly used cross-modal retrieval benchmarks. Moreover, TEAM is interpretable and we provide visualization experiments to show how it works. At last, we construct a new billion-scale vision-language pre-training dataset in Chinese, which is the largest Chinese vision-language pre-training dataset so far. After pre-training on this dataset, our framework also achieves state-of-the-art performance on Chinese cross-modal retrieval benchmarks.
0 Replies
Loading