Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Diagnostic evaluation of VLMs on Bongard Problems that reveals problems in perception and reasoning.
Abstract: Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's o1, have emerged, seemingly demonstrating advanced reasoning capabilities across text and image modalities. However, the depth of these advances in language-guided perception and abstract reasoning remains underexplored, and it is unclear whether these models can truly live up to their ambitious promises. To assess the progress and identify shortcomings, we enter the wonderland of Bongard problems, a set of classic visual reasoning puzzles that require human-like abilities of pattern recognition and abstract reasoning. With our extensive evaluation setup, we show that while VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, when explicitly asked to recognize ground truth concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. We compare the results of VLMs to human performance and observe that a significant gap remains between human visual reasoning capabilities and machine cognition.
Lay Summary: New AI systems called Vision-Language Models (VLMs), like OpenAI’s o1, are designed to understand and reason about both pictures and text. These models are impressive on the surface, but it's still unclear how well they truly "understand" what they see. To test their abilities, we used a set of tricky visual puzzles known as Bongard problems. These puzzles challenge the kind of pattern recognition and abstract thinking that humans are naturally good at. Our tests revealed that while these AI models can sometimes spot key patterns and solve a few puzzles, they often struggle, especially with concepts that seem simple to people, like recognizing a spiral. Even when we clearly told the models what they were supposed to look for, they still had trouble. We also compared their performance to how humans do on the same tasks. The results showed a big gap: human thinking and visual understanding are still far ahead of what these AI systems can do.
Link To Code: https://github.com/ml-research/bongard-in-wonderland
Primary Area: Deep Learning->Foundation Models
Keywords: Visual Reasoning, Vision Language Models, Bongard problems
Submission Number: 15802
Loading