OPTiCAL: An Abstract Positional Reasoning Benchmark for Vision Language Models

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Model, VLM, Multimodal LLM, MLLM, VQA, LLM reasoning, compositional reasoning, object hallucination, object relationship hallucination, CLIP
TL;DR: Six open source encoder-decoder VLMs are evaluated with an abstract benchmark consisting of shapes on a plain background.
Abstract: Vision question answering (VQA) tasks increasingly employ Visual Language Models (VLMs), but the performance of these models degrades substantially when applied to out-of-distribution or compositional reasoning tasks. This is especially concerning with wide access to pretrained VLMs, which could lead to misuse and overdependence on the reasoning capabilities of these models. In this work, we analyze the root causes of poor VLM performance by isolating and testing basic visual reasoning skills—specifically, positional understanding—using a novel benchmarking dataset, Shapes30k, generated by our tool, ShapeMaker. Our primary metric is VLM accuracy in the positional reasoning task, and we perform significance testing to detect directional bias in the results. Pretrained VLMs sometimes score below chance (20%) in our benchmark, and we detect varied and significant (p < 0.01) directional biases in each model. Our code is available here: https://anonymous.4open.science/r/optical-benchmark-DAE9/.
Submission Number: 116
Loading