Abstract: Highlights•Effective representation learning for urban spaces using POI data.•The first multimodal contrastive learning model to align spatial and semantic information.•Improved conceptualisation of urban space representations through location encoding.•Enhanced modelling of POI semantics by pre-trained text encoders.•Extensive experiments validate the model's superior performance and robustness across different spatial scales and urban contexts.
Loading