Keywords: Benchmarks, Datasets, Vision Language Model Benchmarks, Multimodal Model Benchmarks
TL;DR: We present a novel dataset for vision language model reasoning that humans can answer but the best VLMs struggle.
Abstract: Maps are central to how humans make sense of the world, from navigation and environmental monitoring to military planning and historical interpretation. Yet despite rapid progress in large multimodal models (LMMs), these systems continue to struggle with interpreting maps--an essential skill for visual reasoning that goes beyond pattern recognition and text extraction. To close this gap, we introduce MapQA, the first large-scale benchmark specifically designed to evaluate LMMs on map understanding. MapQA contains over 4,200 carefully curated, open-ended question–answer pairs spanning diverse map types, each constructed to require reasoning directly from the map rather than relying on memorized world knowledge. Benchmark questions are generated through a scalable human-in-the-loop process to ensure quality, and evaluated using an LLM-as-a-judge protocol aligned with human judgments. Our experiments show that while humans answer over 91% of questions correctly, state-of-the-art proprietary models achieve barely half that performance, with open-source models typically below 30%. These findings highlight a substantial gap between human and machine map understanding, underscoring the need for benchmarks like MapQA to guide future progress in multimodal reasoning.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 19423
Loading