Keywords: vision language models, large language models, large multimodal models, vision-language reasoning, benchmarking, visual geometry
TL;DR: We introduce AVSBench, a novel benchmark designed to evaluate 36 atomic visual skills in geometry, and conduct experiments on influential models to show that they struggle with fundamental geometric visual tasks.
Abstract: Recent Vision Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, but they often struggle with trivially simple visual tasks. In this work, we introduce the Atomic Visual Skills Benchmark (AVSBench) to evaluate whether VLMs possess capabilities to understand basic geometric features, which we refer to as atomic visual skills. Specifically, we systematically categorize the atomic visual skills and handcraft a set of 5,073 diverse questions designed to assess each individual atomic visual skill. Using AVSBench, we evaluate the current leading VLMs and find that they struggle with most of these atomic visual skills that are obvious to humans.
Concurrent Submissions: N/A
Submission Number: 71
Loading