Abstract: Highlights•Representing regional features by capturing associations among street-level images.•The proposed Vision-LSTM extract features from varying numbers of images.•Multimodal model fused satellite imagery, street-level imagery, and mobility data.•Both visual and dynamic mobility information crucial for urban village recognition.•The framework achieved 91.6% accuracy in identifying urban villages.
Loading