UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web

Published: 01 Jan 2024, Last Modified: 13 Nov 2024WWW 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Urban region profiling from web-sourced data is of utmost importance for urban computing. We are witnessing a blossom of LLMs for various fields, especially in multi-modal data research such as vision-language learning, where text modality serves as a supplement for images. As textual modality has rarely been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions: i) Can text modality enhance urban region profiling? ii) and if so, in what ways and which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of text modality into urban imagery, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP ). Specifically, it first generates a detailed textual description for each satellite image by Image-to-Text LLMs. Then, the model is trained on image-text pairs, seamlessly unifying language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on urban indicator prediction in four major metropolises show its superior performance, with an average improvement of 6.1% on R2 compared to the state-of-the-art methods. Our code and dataset are available at https://github.com/StupidBuluchacha/UrbanCLIP.
Loading