Keywords: Multimodal Large Language Models, Abstract Visual Reasoning, Bongard Problems
Abstract: Abstract visual reasoning (AVR) encompasses a suite of tasks whose solving requires the ability to discover common concepts underlying the set of pictures through an analogy-making process, similarly to solving the human IQ test problems. Bongard Problems (BPs), proposed in 1968, constitute one of the fundamental challenges in this domain. Despite multiple advances in artificial intelligence, the BP tasks remain unsolved, mainly due to their requirement to combine visual reasoning and verbal description. In this work, we pose a question whether multimodal large language models (MLLMs) inherently designed to combine vision and language are capable of tackling BPs. To this end, we propose a set of diverse MLLM-suited strategies to tackle BPs and test 4 popular proprietary MLLMs: GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Claude 3.5 Sonnet, and 4 publicly available open models: InternVL2-8B, LLaVa-1.6 Mistral-7B, Phi-3.5-Vision, and Pixtral 12B. The above MLLMs are compared on 3 BP datasets from the AVR literature: a set of original BP instances relying on synthetic, geometry-based images and two recent datasets based on real-world images, i.e., Bongard-HOI and Bongard-OpenWorld. Our experiments reveal significant limitations of the current MLLMs in solving BPs. In particular, the models struggle to solve the classical set of synthetic BPs representing abstract concepts, despite their visual simplicity. Though their performance improves for real-world concepts expressed in Bongard-HOI and Bongard-OpenWorld datasets, the models still have difficulty in utilizing new information to improve their predictions, as well as utilizing the dialog context window effectively. To better capture the reasons of this performance discrepancy between synthetic and real-world AVR domains, we propose Bongard-RWR, a new BP dataset composed of specifically-designed real-world images that translate concepts from hand-crafted synthetic matrices to the real world, and perform focused experiments with this new dataset. The results suggest that weak models' performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR limitations.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6466
Loading