TL;DR: Benchmark for VLM referring expression comprehension over maps, including basic map element understanding and geospatial reasoning.
Abstract: Referring expression comprehension (REC) localizes or segments image regions described by natural language. Despite substantial progress on natural-image benchmarks, it remains unclear how well vision--language models (VLMs) transfer to structured, geospatial, and map-based imagery. In this work, we introduce MapRef, a large-scale REC benchmark for map understanding covering diverse map modalities and spatial scales, including general geographic maps, weather visualizations, land-cover layers, and other domain-specific map types. Evaluating three state-of-the-art VLMs, we find a pronounced performance drop from natural-image REC to map-based REC, highlighting significant challenges and opportunities in spatial reasoning and cross-modality grounding for maps.
Submission Number: 4
Loading