Keywords: Multimodal, robustness, distribution shift
TL;DR: We investigate and benchmark the robustness of multimodal image-text models under image and text perturbations.
Abstract: Multimodal image-text models have shown remarkable performance in the past few years. However, the robustness of such foundation models against distribution shifts is crucial in downstream applications. In this paper, we investigate their robustness under image and text perturbations. We first build several multimodal benchmark datasets by applying 17 image perturbation and 16 text perturbation techniques. Then we extensively study the robustness of 6 widely adopted models on 3 downstream tasks (image-text retrieval, visual reasoning, and visual entailment). We observe that these powerful multimodal models are sensitive to image/text perturbations, especially to image perturbations. For text, character-level perturbations have shown higher adversarial impact than word-level and sentence-level perturbations. We also observe that models trained by generative objectives tend to be more robust. Our findings in terms of robustness study could facilitate the development of large image-text models, as well as their deployment for real-world applications.