Scaling Randomized Smoothing to state-of-the-art Vision-Language Models

Published: 06 Mar 2025, Last Modified: 19 Mar 2025ICLR 2025 Workshop VerifAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Randomized Smoothing, Vision Language Models, AI Safety
TL;DR: We extend and scale Randomized Smoothing to Vision Language models
Abstract: Certifying the robustness of Deep Neural Networks (DNNs) is crucial, especially with the rise of powerful generative models, such as Large Language Models (LLMs) or Vision-Language Models (VLMs), that have the potential of generating dangerous or harmful responses. Recent work has shown that these large-scale models are still susceptible to adversarial attacks, despite their safety fine-tuning. Randomized Smoothing (RS), the current state-of-the-art (SoTA) method for robustness certification, cannot be applied on models such as VLMs: first, RS is designed for classification, not generation. Second, RS is a probabilistic approach, typically requiring $10^5$ samples to certify a single input, making it infeasible for large-scale modern VLMs. This is the challenge we aim to tackle in this work. First, we reformulate RS for the case of generative models, where we distinguish between harmless and harmful responses. Moreover, we develop a theory that allows us to reduce the number of samples required by 2-3 orders of magnitude, without much effect on the certified radius, and mathematically analyze its dependence to the number of samples. Combined, these advances allow us to scale RS on SoTA VLMs, something that was not feasible before. We successfully showcase this experimentally by defending against a recent SoTA attack against aligned VLMs.
Submission Number: 31
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview