MapQA: A Map-Question-Answering Benchmark for Visual Language Model Reasoning

Christian Michael Arnold; Andrew Alini; Jonathan Wang; Pieter M Feenstra; Conner Arnold; Jan DeWitt; Natalie C Ritsema; Jung Hyun Yae; Boris Katz; Andrei Barbu; Brian Cheung

MapQA: A Map-Question-Answering Benchmark for Visual Language Model Reasoning

Christian Michael Arnold, Andrew Alini, Jonathan Wang, Pieter M Feenstra, Conner Arnold, Jan DeWitt, Natalie C Ritsema, Jung Hyun Yae, Boris Katz, Andrei Barbu, Brian Cheung

Published: 02 Mar 2026, Last Modified: 14 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: Benchmarks, Datasets, Vision Language Model Benchmarks, Multimodal Model Benchmarks, Map Question Answering

TL;DR: We present a novel dataset for vision language model reasoning for map understanding that humans can answer but the best VLMs struggle.

Abstract: Maps are central to how humans make sense of the world, from navigation and environmental monitoring to military planning and historical interpretation. Yet despite rapid progress in large multimodal models (LMMs), these systems continue to struggle with interpreting maps--an essential skill for visual reasoning that goes beyond pattern recognition and text extraction. To further progress toward this capability, we introduce MapQA, the first large-scale benchmark specifically designed to evaluate LMMs on map understanding. MapQA contains over 4,200 carefully curated, open-ended question–answer pairs spanning diverse map types, each constructed to require reasoning directly from the map rather than relying on memorized world knowledge. Benchmark questions were generated through a scalable human-in-the-loop process to ensure quality, and evaluated using an LLM-as-a-judge protocol aligned with human judgments. Our experiments show that while humans answer over 91\% of questions correctly, state-of-the-art proprietary models achieve barely half that performance, with open-source models typically below 30\%. These findings highlight a substantial gap between human and machine map understanding, underscoring the need for benchmarks like MapQA to guide future progress in multimodal reasoning.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 48

Loading