RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning in Urban Road Scenarios

Jun Zhang; Jie Feng; Long Chen; Junhui Wang; Zhicheng Liu; Depeng Jin; Yong Li

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning in Urban Road Scenarios

Jun Zhang, Jie Feng, Long Chen, Junhui Wang, Zhicheng Liu, Depeng Jin, Yong Li

19 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Spatial Reasoning, Benchmark, Road, Urban

TL;DR: A benchmark to evaluate MLLMs' fine-grained spatial understanding and reasoning capabilities with 6 urban scenario tasks and 9,121 test casts.

Abstract: Multimodal large models (MLLMs) have demonstrated powerful capabilities in visual-language understanding and reasoning. However, their fine-grained spatial understanding and reasoning capabilities in complex urban scenarios have not received significant attention in the fields of both research and benchmarks. To fill this gap, we focus primarily on road markings as a typical example of fine-grained spatial elements in urban scenarios, given the essential role of the integrated road traffic network they form within cities. Around road markings and urban traffic systems, we propose **RoadBench**, a systematic benchmark that comprehensively evaluates MLLMs' fine-grained spatial understanding and reasoning capabilities using BEV and FPV image inputs. This benchmark comprises six benchmark tasks consisting of 9,121 strictly manually verified test cases. These tasks form an evaluation framework that bridges understanding at local spatial scopes to global reasoning. They not only test MLLMs' capabilities in recognition, joint understanding, and reasoning but also assess their ability to integrate image information with domain knowledge. After evaluating 14 mainstream MLLMs, we confirm that RoadBench is a challenging benchmark for MLLMs while revealing significant shortcomings in existing MLLMs' fine-grained spatial understanding and reasoning capabilities within urban scenarios. In certain tasks, their performance even falls short of simple rule-based or random selection baselines. These findings, along with RoadBench itself, will contribute to the comprehensive advancement of spatial understanding capabilities for MLLMs. The benchmark code, example datasets, and raw evaluation results are available at https://anonymous.4open.science/r/RoadBench-A00E.

Primary Area: datasets and benchmarks

Submission Number: 19800

Loading