Red-Teaming NSFW Image Classifiers as Text-to-Image Safeguards

Tinghao Xie; Yueqi Xie; Alireza Zareian; Shuming Hu; Felix Juefei-Xu; Xiaowen Lin; Ankit Jain; Prateek Mittal; Li Chen

Red-Teaming NSFW Image Classifiers as Text-to-Image Safeguards

Tinghao Xie, Yueqi Xie, Alireza Zareian, Shuming Hu, Felix Juefei-Xu, Xiaowen Lin, Ankit Jain, Prateek Mittal, Li Chen

15 Sept 2025 (modified: 22 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: red-teaming, NSFW classification, generative AI, robustness, adversarial attacks, LLM, T2I

TL;DR: We present a systematic framework to automatically red-team NSFW image classifiers (which are often deployed as text-to-image system safeguards) against image context shifts.

Abstract: Not Safe for Work (NSFW) image classifiers play a critical role in safeguarding text-to-image (T2I) systems. However, a concerning phenomenon has emerged in T2I systems -- changes in text prompts that manipulate benign image elements can result in failed detection by NSFW classifiers -- dubbed "*context shifts*." For instance, while a NSFW image of *a nude person in an empty scene* can be easily blocked by most NSFW classifiers, a stealthier one that depicts *a nude person blending in a group of dressed people* may evade detection. How to systematically reveal NSFW image classifiers' failure against context shifts? Towards this end, we present an automated red-teaming framework that leverages a set of generative AI tools. We propose an **exploration-exploitation** approach: First, in the *exploration* stage, we synthesize a diverse and massive 36K NSFW image dataset that facilitates our study of context shifts. We find that varying fractions (e.g., $4.1$% to $36$% nude and sexual content) of the dataset are misclassified by NSFW image classifiers like GPT-4o and Gemini. Second, in the *exploitation* stage, we leverage these failure cases to train a specialized LLM that rewrites unseen seed prompts into more evasive versions, increasing the likelihood of detection evasion by up to 6 times. Alarmingly, we show **these failures translate to real-world T2I(V) systems**, including DALL-E 3, Sora, Gemini, and Grok, beyond the open-weight image generators used in our red-teaming pipeline. For example, querying DALL-E 3 and Imagen 3 with prompts rewritten by our approach increases the chance of obtaining NSFW images from $0$ to over $44$%.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 6299

Loading