Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Abstract: Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}.
Lay Summary: Imagine you are shown two triangles: one drawn with thick black lines, another made of colorful dots. You instantly recognize both as triangles. This ability to see past surface appearances and grasp underlying concepts is called "conceptualization," and it's fundamental to human intelligence. While AI systems have become remarkably good at answering questions about images, they struggle with this basic skill. We wondered: can AI truly understand concepts, or does it just memorize visual patterns? To find out, we created the Visual Graph Arena (VGA), a test featuring six challenges involving networks of connected points. We presented these graphs in different visual styles to both humans and cutting-edge AI systems, testing whether they could reason about structure regardless of appearance. The results were striking. Humans achieved near-perfect scores across all tasks, while AI models failed dramatically. They couldn't even recognize when two graphs had identical structures but different visual presentations, and showed only limited success at basic tasks like finding paths. Even worse, we found evidence of superficial pattern matching rather than genuine understanding. Our findings reveal critical gaps in current AI vision systems and provide a framework for building more human-like artificial intelligence.
Link To Code: https://vga.csail.mit.edu/
Primary Area: Applications->Computer Vision
Keywords: Graph, Visual Reasoning, Reasoning, LLM, Multimodal LLM
Submission Number: 12671
Loading