TNG-CLIP: Training-Time Negation Data Generation for Negation Awareness of CLIP

ACL ARR 2026 January Submission6633 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: image text matching, data augmentation, data-efficient training, contrastive learning, multimodal applications
Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods approach the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. We overcome these limitations by (1) introducing a training-time negation data generation pipeline for CLIP fine-tuning such that large-scale negation captions are efficiently generated during training, which only increases 2.8\% extra training time, and (2) proposing the first benchmark, Neg-T2I, for evaluating text-to-image generation models on prompts containing negation to assess the ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and the proposed Neg-T2I.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching, data augmentation, data-efficient training, contrastive learning, multimodal applications
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 6633
Loading