Track: long paper (up to 10 pages)
Keywords: Vision Language Model, Negation Understanding, Affirmative Bias, Described Object Detection, Chain-of-Thought Reasoning, Token Merging
TL;DR: Enhances vision-language models' negation understanding in object detection through a novel dataset built with chain-of-thought reasoning and negation-aware token merging techniques.
Abstract: Despite their remarkable capabilities in natural language understanding, Vision-Language Models (VLMs) exhibit critical bottlenecks in fundamental logical reasoning, particularly in processing the logical operator of negation. This deficiency frequently results in self-contradictory predictions, where models fail to differentiate between a concept and its negation ($A$ vs.\ $\neg A$), a phenomenon often observed as "affirmative bias'' in visual contexts. In this work, we leverage Described Object Detection (DOD) as a rigorous testbed to evaluate and resolve these logical inconsistencies. To address this, we propose two primary contributions. First, we introduce CoVAND, a dataset constructed via a deductive chain-of-thought (CoT) reasoning pipeline that synthesizes consistent, instance-grounded logical propositions. Second, we present NegToMe, a novel token merging module that acts as a symbolic representation mechanism. NegToMe directly mitigates the structural loss of logical operators caused by standard tokenization. By explicitly binding negation cues with their target operands (e.g., merging "not'' and "girl'' into a singular, structurally coherent $\neg \text{girl}$ token), it preserves strict logical polarity at the input representation level. Evaluated on rigorous consistency benchmarks, our lightweight adaptation approach significantly reduces self-contradictory false positives and boosts NMS-AP by up to +10.8 points on OVDEval. This work demonstrates an effective framework for embedding symbolic logical operations into VLMs, paving the way for more reliable deductive reasoning in multimodal applications.
Presenter: ~Inha_Kang1
Format: Yes, the presenting author will definitely attend in person because they attending ICLR for other complementary reasons.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 69
Loading