Textual Training for the Hassle-Free Removal of Unwanted Visual Data: Case Studies on OOD and Hateful Image Detection

Published: 25 Sept 2024, Last Modified: 14 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language models, Multimodal models, CLIP, unwanted visual data detection, text-only training, hateful image detection
TL;DR: We introduce Hassle-Free Textual Training (HFTT), a method designed to improve the performance of VLMs in identifying unwanted visual data, utilizing solely synthetic textual data.
Abstract: In our study, we explore methods for detecting unwanted content lurking in visual datasets. We provide a theoretical analysis demonstrating that a model capable of successfully partitioning visual data can be obtained using only textual data. Based on the analysis, we propose Hassle-Free Textual Training (HFTT), a streamlined method capable of acquiring detectors for unwanted visual content, using only textual data in conjunction with pre-trained vision-language models. HFTT features an innovative objective function that significantly reduces the necessity for human involvement in data annotation. Furthermore, HFTT employs a clever textual data synthesis method, effectively emulating the integration of unknown visual data distribution into the training process at no extra cost. The unique characteristics of HFTT extend its utility beyond traditional out-of-distribution detection, making it applicable to tasks that address more abstract concepts. We complement our analyses with experiments in hateful image detection and out-of-distribution detection. Our codes are available at https://github.com/HFTT-anonymous/HFTT.
Primary Area: Other (please use sparingly, only use the keyword field for more details)
Submission Number: 3444
Loading