Seeing What’s Not There: Negation Understanding Needs More Than Training

ICLR 2026 Conference Submission8246 Authors

17 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Negation, Zeroshot, VisionlanguageModels, MachineLearning, ComputerVision, DeepLearning
Abstract: Understanding the negation in a sentence is an important part of compositional understanding and logic in natural language. Many practical AI applications, such as autonomous driving, include precise instruction with negations. For example, following instruction to an AI assistant ”locate a parking spot without a vehicle” requires the assistant to not confuse between presence and absence of vehicles. Al- though joint embedding-based Vision Language Models (VLMs) like CLIP have revolutionized multi-modal tasks, they struggle to interpret negation. To address this limitation, recently many works proposed to solve the problem through a data- centric approach by introducing additional datasets with hard-negative samples for both image and text data. Contrary to these approaches, we present a zero-shot approach to tackle the negation understanding problem. We probe the properties of CLIP text embeddings and show that they follow compositional arithmetic op- erations, which allow the addition or removal of semantic information directly in the embedding space. We then present a rule-based approach to extract negated text from given caption and then use it to explicitly remove corresponding se- mantic information from original embedding, improving negation understanding in VLMs. Our approach does not require expensive training process to induce negation understanding into the model, and achieves the state-of-the-art perfor- mance on popular benchmark for negation understanding. We improve baseline CLIP model performance on NegBench from 25.5% to 67.0% for MCQ and from 50.9% to 56.1% for retrieval tasks. Even NegCLIP model which is fine-tuned on negtion datasets, our approach boosts its MCQ accuracy from 54.03% to 66.22% and retrieval accuracy from 59.25% to 60.1% showing strong performance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8246
Loading