Vision Language Models are Biased: Counting legs of an animal is surprisingly hard

An Vo; Khai-Nguyen Nguyen; Mohammad Reza Taesiri; Vy Tuong Dang; Anh Totti Nguyen; Daeyoung Kim

Vision Language Models are Biased: Counting legs of an animal is surprisingly hard

An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim

Published: 09 Jul 2025, Last Modified: 21 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: vision language models, limitation analysis, bias, reasoning, benchmark, counting

TL;DR: VLMs struggle with simple counting tasks under strong perceptual bias.

Abstract: Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks, but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurts the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting, a common mathematical skill in everyday life. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, boardgames, optical illusions, to patterned grids. Insert text (e.g, "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05\% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, boardgames, optical illusions, to patterned grids. Inserting text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our findings reveal critical limitations of VLM capabilities in visual counting, posing an important question for how to perform math under strong perceptual bias.

Submission Number: 70

Loading