Abstract: Video-text retrieval has widespread applications in economic and security domains, making it crucial to evaluate its robustness through adversarial attack. However, the existing research in this field is inadequate. In this paper, we first introduce adversarial attack to this task. By leveraging the concept of metric learning, we propose novel attack methods Cross-modal Dual Level Contrastive Attack (CDCA) and Cross-modal Rank Pairing Attack (CRPA). In the white-box scenario, CDCA utilizes the distribution of head and tail examples in the retrieval list to form positive and negative example sets, employing both coarse and fine-grained features. In the black-box scenario, CRPA employs the rank difference in retrieval list as example pairs and utilizes the Rank Difference Loss (RDL) as the attack objective function. Experiments validate the superiority of our methods. Furthermore, we contribute a benchmark, which lays a foundation for understanding the vulnerability of multi-modal models.
Loading