Keywords: CAPTCHA, Vision–language models (VLMs), Visual perception, Commonsense bias / confirmation bias, Human–AI performance gap
TL;DR: Humans still beat SOTA VLMs on trivial low-level perception (size, color, limb counting), enabling practical AI-targeted CAPTCHAs.
Abstract: As generative artificial intelligence advances rapidly, traditional CAPTCHAs designed to distinguish humans from machines are losing efficacy. State‑of‑the‑art deep learning systems now solve conventional challenges such as distorted text recognition with near‑perfect accuracy. Motivated by this, we explore “AI‑targeted CAPTCHAs”—challenge tasks that humans pass easily but multimodal models find difficult. Building on a review of prior work, we posit two cognitive pathways that contemporary models rely on: a “linguistic path” versus a “perceptual path.” Guided by these hypotheses, we design five simple, highly intuitive visual question‑answer tasks to systematically compare humans with leading multimodal models. Each task pairs a single image with a single question and covers color discrimination, size comparison, combined distractors, and counting legs on birds or fingers on human hands. We evaluate five mainstream multimodal systems under the same conditions as human participants and test two main hypotheses plus one sub‑hypothesis. Results show: (1) on single‑feature tasks such as color discrimination, top models approach human‑level performance; however, for size comparison and combined‑difference tasks that require low‑level visual perception, model accuracy collapses to near zero while humans perform almost perfectly; (2) in bird‑leg and hand‑finger counting tasks, models frequently default to stereotyped prior knowledge, achieving ~20% accuracy or lower, whereas humans rely on the image and score near 100%; (3) models recognize abnormal human fingers slightly better than abnormal bird legs, supporting sub‑hypothesis H1.1 that models handle common human limbs better than non‑typical species. These differences confirm that current vision‑language models primarily follow a linguistic path for image understanding and lack a human‑like low‑level perceptual path for processing obvious visual information. The findings quantify the limits of current multimodal systems and demonstrate the feasibility of constructing new CAPTCHAs from such “AI‑hard” tasks.
Supplementary Material: zip
Submission Number: 134
Loading