Fooling Contrastive Language-Image Pre-Trained Models with CLIPMasterPrints

Published: 19 Apr 2024, Last Modified: 19 Apr 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Models leveraging both visual and textual data such as Contrastive Language-Image Pre-training (CLIP), are the backbone of many recent advances in artificial intelligence. In this work, we show that despite their versatility, such models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts, while being either unrecognizable or unrelated to the attacked prompts for humans. We demonstrate how fooling master images can be mined using stochastic gradient descent, projected gradient descent, or gradient-free optimisation. Contrary to many common adversarial attacks, the gradient-free optimisation approach allows us to mine fooling examples even when the weights of the model are not accessible. We investigate the properties of the mined fooling master images, and find that images trained on a small number of image captions potentially generalize to a much larger number of semantically related captions. Finally, we evaluate possible mitigation strategies and find that vulnerability to fooling master examples appears to be closely related to a modality gap in contrastive pre-trained multi-modal networks.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We would like to thank all reviewers for their insightful comments, which we believe have improved our paper. The most important new additions are: (1) an experiment using WordNet hyponyms to confirm that CLIPMasterPrints in fact generalize to semantically related classes; (2) an additional section to the appendix further exploring the behavior of CLIPMasterPrints when targeting an increasing amount of classes/prompts; (3) Section 3 has been rewritten to be more accessible to a wider audience of readers. Furthermore, the connection to the modality gap and how it is exploited by CLIPMasterPrints is drawn earlier and more explicit; (4) we have elaborated our view on the fact that CLIPMasterPrint exhibit visual artifacts, and how these artifacts impact the envisioned risk for real-world systems in Section 5. (5) Formulations which have been found to be too vague by the reviewers have been revised. Furthermore a number of stylistic and typographic issues pointed out by the reviewers have been fixed. We hope that the reviewers find these changes to be in line with their remarks, to which we respond in detail in a point-by-point fashion below. Revision 2: As suggested, an experiment on automatically detecting CLIPMasterPrints by means of a classifier has been added Revision 3: Added camera-ready revision: added author names/affiliations and acknowledgements, minor aesthetic improvements (removed shortcuts in section titles, fixed plot leaking into legend for figure 5), updated supplementary material to include code of additional experiments.
Supplementary Material: zip
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 1950