Abstract: Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLM-suited solution strategies and testing $4$ proprietary and $4$ open-access models on $3$ BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR limitations. Code and dataset are available at: https://github.com/pavonism/bongard-rwr
Lay Summary: Humans are good at recognizing abstract patterns in images, like those found in IQ tests. But can advanced AI models do the same? Our study investigates whether multimodal large language models that can process both images and text can solve Bongard Problems, a type of visual-textual reasoning challenge. We tested several advanced AI models on tasks involving synthetic and real-world images. While these models showed some success with real-world tasks, they struggled with synthetic puzzles, which are more abstract. To probe this issue, we created Bongard-RWR, a new dataset representing abstract concepts with real-world images. Our findings suggest that the AI models’ difficulties aren't just because of an unfamiliar synthetic image domain, but stem from fundamental limitations in how they understand visual concepts. This highlights the need to improve abstract reasoning capabilities of these models, a step toward more human-like reasoning.
Link To Code: https://github.com/pavonism/bongard-rwr
Primary Area: Deep Learning->Foundation Models
Keywords: Multimodal Large Language Models, Abstract Visual Reasoning, Bongard Problems
Submission Number: 6624
Loading