TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models

Kangjie Chen; Li Muyang; Guanlin Li; Shudong Zhang; Shangwei Guo; Tianwei Zhang

TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models

Kangjie Chen, Li Muyang, Guanlin Li, Shudong Zhang, Shangwei Guo, Tianwei Zhang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a new automatic red-teaming method with a feedback mechanism, which can effectively discover safety risks.

Abstract: Vision-Language Models (VLMs) have become a cornerstone in multi-modal artificial intelligence, enabling seamless integration of visual and textual information for tasks such as image captioning, visual question answering, and cross-modal retrieval. Despite their impressive capabilities, these models often exhibit inherent vulnerabilities that can lead to safety failures in critical applications. Red-teaming is an important approach to identify and test system's vulnerabilities, but how to conduct red-teaming for contemporary VLMs is an unexplored area. In this paper, we propose a novel multi-modal red-teaming approach, TRUST-VLM, to enhance both the attack success rate and the diversity of successful test cases for VLMs. Specifically, TRUST-VLM is built upon the in-context learning to adversarially test a VLM on both image and text inputs. Furthermore, we involve feedback from the target VLM to improve the efficiency of test case generation. Extensive experiments show that TRUST-VLM not only outperforms traditional red-teaming techniques in generating diverse and effective adversarial cases but also provides actionable insights for model improvement. These findings highlight the importance of advanced red-teaming strategies in ensuring the reliability of VLMs.

Lay Summary: Many modern AI systems can look at images, read accompanying text, and then answer questions or describe what they see. While these Vision-Language Models are powerful, they can still be tricked into producing harmful or biased outputs by carefully crafted inputs, raising safety and reliability concerns in real-world applications. In this work, we introduce TRUST-VLM, an automated “red-teaming” method that mimics adversarial attacks by generating and refining realistic image-and-text test cases. By feeding back model responses into the testing loop, TRUST-VLM uncovers a wider range of hidden failures than previous techniques. Our extensive experiments on multiple leading models show that this approach not only finds more vulnerabilities but also suggests concrete ways to make these AI systems safer before they are deployed to users.

Primary Area: Social Aspects->Safety

Keywords: Vision Language Models, Red-teaming, Trustworthy AI

Submission Number: 3207

Loading