Keywords: geospatial ai, referring expression comprehension, vision language models, vision language benchmark, computer vision
Abstract: Referring expression comprehension (REC) aims to localize or segment visual regions described by natural language expressions. While extensive REC benchmarks exist in natural image domains, there remains a gap in understanding how vision--language models (VLMs) generalize to structured, geospatial, and map-based imagery.
In this work, we present \textbf{MapRef}, the first large-scale REC benchmark for geospatial map understanding, spanning a broad spectrum of map modalities: generic cartographic maps, weather visualizations, agricultural and land-cover layers, and other domain-specific map types. Using publicly available raster and vector geospatial data (e.g., NOAA, ESA, OSM), we programmatically generate image--text--mask triplets that cover diverse projections, spatial scales, and reasoning types.
Our evaluation across a suite of 3 recent SOTA vision--language models reveals a significant performance gap between natural image domain REC and map-based REC tasks: at 0.10 IoU, GPT-5 achieves only 31.5\% accuracy for US-county level administrative boundaries, and 20.7\% accuracy for country level administrative boundaries over global maps.
\textbf{MapRef} serves as a foundation benchmark for spatial reasoning, cross-modality grounding, and geospatial learning.
Submission Number: 20
Loading