BackBench: Are Vision Language Models Resilient to Object-to-Background Context?

23 Sept 2023 (modified: 07 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Robustness, Real Image Editing, Foundational models, Adversarial Examples, Counterfactual images
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: In this paper, we evaluate the resilience of modern vision and multimodal foundational models against object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Our approach, on the other hand, can change the background of real images using text prompts thus allowing diverse changes to the background. We achieve this while preserving the original appearance and semantics of the object of interest. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. By using textual guidance for control, we produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing variations in the color and texture of the background. Additionally, we craft adversarial backgrounds by optimizing the latent variables and text embeddings within text-to-image models. We conduct thorough experimentation and provide an in-depth analysis of the robustness of vision and language models against object-to-background context variations across different tasks. Our code and evaluation benchmark along with the datasets will be publicly released.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7542
Loading