Disparities in Negation Understanding Across Languages in Vision-Language Models

Published: 02 Mar 2026, Last Modified: 14 Apr 2026AFAA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny/Short Papers Track (up to 3 pages)
Keywords: multilingual, vision-language models, negation, fairness, benchmark
TL;DR: Our multilingual benchmark reveals that VLM negation understanding is unequal across languages, raising fairness concerns for global deployment.
Abstract: Vision-language models (VLMs) exhibit \emph{affirmation bias}: a systematic tendency to select positive captions (``X is present'') even when the correct description contains negation (``no X''). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns - raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.
Submission Number: 49
Loading