Keywords: Multimodal Models, Program Synthesis, Visual Programming, Benchmark, Turtle Graphics, Visual Reasoning
TL;DR: A Multimodal Benchmark for Visual Programming and Reasoning
Abstract: Multimodal vision-language models (VLMs) have achieved remarkable success in fundamental visual tasks like image captioning and visual question answering. However, their performance on complex visual tasks requiring integrated visual reasoning and problem-solving capabilities remains underexplored. To bridge this gap, we introduce TurtleAI, a multimodal benchmark to evaluate VLMs on visual programming and reasoning tasks in the Turtle Graphics domain. Our benchmark contains 823 visual programming tasks that challenge VLMs to generate Python code to replicate patterns in images. Evaluation of 20 VLMs reveals that state-of-the-art models like GPT-4o and Qwen2-VL-72B struggle with these tasks, achieving success rates of only 26.5% and 11.8% respectively. Our analysis reveals that models often fail to align their code implementation with visual reasoning. To address this misalignment, we propose TurtleAI-Datagen, a data generation framework that creates large-scale synthetic datasets consisting of task-code pairs. Using just 10 initial samples, TurtleAI-Datagen generates over 700k samples. Fine-tuning on this dataset significantly reduces errors arising from the misalignment between visual reasoning and program synthesis, improving Qwen2-VL-72B's performance by over 20%. We will release the benchmark publicly to facilitate future research.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 16956
Loading