Keywords: Multi-modal large language model, Cross-view learning, Urban understanding
TL;DR: we propose UrbanMLLM, a versatile model with comprehensive urban understanding capabilities encompassing both remote sensing and street- view data domains.
Abstract: Multimodal large language models (MLLMs) have exhibited remarkable capabilities for performing complex vision-language tasks in various domains.
Currently, MLLMs based on urban imagery in urban studies are only developed focusing on remote sensing imagery.
However, except for the macroscopic information from remote sensing imagery, effective urban understanding also requires detailed appearance information of urban zones from street-view imagery, which is largely overlooked by existing MLLMs.
The primary challenges of developing such a versatile urban MLLM are twofold.
Firstly, it needs a large-scale corpus with well-organized, cross-view urban imagery paired with corresponding text for cross-modal training.
Secondly, traditional MLLMs typically learn image-text pairs independently, hard to support joint modeling of cross-view urban imagery.
To address these challenges, in this work, we propose UrbanMLLM, a novel MLLM that jointly learns from remote sensing and street-view imagery to harness their complementary information.
We first collect a large-scale dataset containing satellite-view and street-view imagery along with their geotags and annotated texts.
Technically, we propose a brand MLLM architecture with a cross-view perceiver to explicitly connect visual information of cross-view urban imagery.
We also introduce a novel pre-training paradigm based on structural interleaved urban image-text documents integrating satellite-view, street-view imagery and related textual descriptions.
This approach encourages the model to implicitly learn the relationships between different types of urban imagery, enhancing the understanding in each domain.
We evaluate our model on a comprehensive benchmark comprising 13 diverse urban understanding tasks across satellite-view, street-view, and cross-view domains. These tasks include scene classification, object reasoning, spatial relationship reasoning, geo-localization, landmark reasoning, and indicator prediction, providing a robust assessment of the model's capabilities.
Extensive experiments demonstrate that UrbanMLLM achieves an average of 27.3\% and 25.5\% performance improvement compared with the best open-sourced and closed-sourced MLLMs, respectively.
Moreover, we thoroughly study the impact of different pre-training data choices and model scales on performance, offering practical insights for effective MLLM design. The proposed UrbanMLLM offers a scalable and versatile solution for understanding urban environments.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4194
Loading