UrbanGeoEval: A City-Scale Benchmark for Evaluating Large Language Models in Geospatial Reasoning

ACL ARR 2026 January Submission8457 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Geospatial Reasoning, Benchmark, Urban Intelligence, Knowledge Decoupling
Abstract: Current evaluations of geospatial reasoning in LLMs frequently impeded by the entanglement of factual recall and spatial logic, which often obscures the models' true capabilities in complex city-scale environments. To address this, we introduce UrbanGeoEval, a comprehensive benchmark featuring a dual-module framework designed to disentangle these competencies. The Knowledge Module assesses urban memory via scalable map-based queries, while the Reasoning Module isolates pure logical inference across 3,148 realistic tasks by providing necessary geospatial context. Our evaluation methodology introduces a reliable hybrid pipeline that merges deterministic programmatic checks with an LLM-as-a-Judge, achieving expert-level evaluation accuracy. Extensive experiments on 16 widely used LLMs uncovers critical insights: (1) models exhibit severe geographic biases and "resolution gaps"; (2) failures in complex multi-hop tasks often stem from brittle foundational spatial skills (e.g., topology and arithmetic) rather than high-level logic deficits. UrbanGeoEval provides a precise diagnostic tool for advancing urban geospatial intelligence in LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Question Answering, Interpretability and Analysis of Models for NLP, NLP Applications
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 8457
Loading