UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web

Yibo Yan; Haomin Wen; Siru Zhong; Wei Chen; Haodong Chen; Qingsong Wen; Roger Zimmermann; Yuxuan Liang

UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web

Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, Yuxuan Liang

Published: 23 Jan 2024, Last Modified: 23 May 2024TheWebConf24 OralEveryoneRevisionsBibTeX

Keywords: urban computing, multimodal learning, satellite imagery, web-sourced data, contrastive learning

Abstract: Accurately profiling urban regions in terms of social, economic and environmental indicators is crucial for urban planning and sustainable development. The ubiquitous urban imagery from the Web, particularly satellite images, has become a key source for inferring urban indicators. Recent studies in urban imagery representation learning suffer from some limitations: 1) most works supervised by specific urban downstream tasks struggle to generalize well across diverse urban prediction tasks, leading to robustness and effectiveness issues. 2) prevalent unsupervised learning approaches for satellite imagery only take into account spatial attributes (e.g., Point-of-Interest and mobility) as supplemental information, but the semantically enhanced and explainable textual modality is rarely leveraged for comprehensive urban region profiling. To address such limitations, this paper introduces a simple yet effective learning framework, Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP), which is trained on image-text pairs, seamlessly unifying natural language supervision for visual representation learning jointly with contrastive loss and language modeling loss. Results from applying learnt satellite visual representations to predict urban indicators in four major Chinese metropolises demonstrate superior performance, with an average improvement of 6.1\% on R^2 compared to the best baseline. This paper sheds light on web mining from satellite imagery and location description for urban region profiling.

Track: Web Mining and Content Analysis

Submission Guidelines Scope: Yes

Submission Guidelines Blind: Yes

Submission Guidelines Format: Yes

Submission Guidelines Limit: Yes

Submission Guidelines Authorship: Yes

Student Author: No

Submission Number: 379

Loading