Keywords: geometry, large multimodal models, mathematical reasoning
TL;DR: We introduce TurtleBench, a benchmark on geometric reasoning and code generation for LMMs and we find that state-of-the-art models perform poorly on this task.
Abstract: While formal geometric reasoning may be difficult for humans without extensive training, humans seem to have the ability to intuitively reason about geometric patterns in images and scenes from a young age. In contrast, developing large multimodal models (LMMs) capable of similar feats represents a frontier in AI research. We introduce TurtleBench, a benchmark designed to evaluate LMMs' capacity to interpret geometric patterns—given visual examples, textual instructions, or both—and generate precise code outputs. Inspired by turtle geometry, a notion used to teach children foundational coding and geometric concepts, TurtleBench features tasks with patterned shapes that have underlying algorithmic logic. Unlike object detection tasks that typically do not involve understanding underlying patterns, this benchmark combines geometrical reasoning with image understanding. Our evaluation reveals that leading LMMs struggle significantly with these tasks, with GPT-4V achieving only 19% accuracy on the simplest tasks. TurtleBench highlights the gap between human and AI performance in intuitive and visual geometrical understanding, setting the stage for future research in this area.
Concurrent Submissions: NeurIPS 2024 Workshop MAR (Multimodal Algorithmic Reasoning)
Submission Number: 48
Loading