Keywords: Negation, Zeroshot, VisionlanguageModels, MachineLearning, ComputerVision, DeepLearning
Abstract: Understanding the negation in a sentence is an important part of compositional
understanding and logic in natural language. Many practical AI applications, such
as autonomous driving, include precise instruction with negations. For example,
following instruction to an AI assistant ”locate a parking spot without a vehicle”
requires the assistant to not confuse between presence and absence of vehicles. Al-
though joint embedding-based Vision Language Models (VLMs) like CLIP have
revolutionized multi-modal tasks, they struggle to interpret negation. To address
this limitation, recently many works proposed to solve the problem through a data-
centric approach by introducing additional datasets with hard-negative samples for
both image and text data. Contrary to these approaches, we present a zero-shot
approach to tackle the negation understanding problem. We probe the properties
of CLIP text embeddings and show that they follow compositional arithmetic op-
erations, which allow the addition or removal of semantic information directly in
the embedding space. We then present a rule-based approach to extract negated
text from given caption and then use it to explicitly remove corresponding se-
mantic information from original embedding, improving negation understanding
in VLMs. Our approach does not require expensive training process to induce
negation understanding into the model, and achieves the state-of-the-art perfor-
mance on popular benchmark for negation understanding. We improve baseline
CLIP model performance on NegBench from 25.5% to 67.0% for MCQ and from
50.9% to 56.1% for retrieval tasks. Even NegCLIP model which is fine-tuned on
negtion datasets, our approach boosts its MCQ accuracy from 54.03% to 66.22%
and retrieval accuracy from 59.25% to 60.1% showing strong performance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8246
Loading