TNG-CLIP: Training-Time Negation Data Generation for Negation Awareness of CLIP

TNG-CLIP: Training-Time Negation Data Generation for Negation Awareness of CLIP

ACL ARR 2025 May Submission2562 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5\% extra training time, and (2) we propose the first benchmark, $\textit{Neg-TtoI}$, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, $\textit{TNG-CLIP}$, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: image text matching, data augmentation, data-efficient training, contrastive learning, multimodal applications

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 2562

Loading