Cross-Modal Semantic Anchoring: Unsupervised Consistency Verification for Aerial Imagery and Maps via Multimodal LLMs

bokang yang; Lichao Mou; YuhangYan; QINGYU LI

Cross-Modal Semantic Anchoring: Unsupervised Consistency Verification for Aerial Imagery and Maps via Multimodal LLMs

bokang yang, Lichao Mou, YuhangYan, QINGYU LI

18 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Semantic Anchoring, Cross-Modal Alignment, Aerial Imagery

Abstract: Up-to-date maps are crucial for urban living, enabling navigation, planning, and decision-making. The increasing accessibility of aerial imagery provides a cost-effective solution for updating map semantics, particularly the representation of buildings, which reflects ongoing urban renewal through construction and demolition. However, aligning heterogeneous modalities—aerial images and maps—remains challenging due to the significant modality gap. While previous works focus on low-level visual feature matching, we argue that these methods ignore the semantic correspondence between the map and aerial imagery. Therefore, we propose U-CSA, an unsupervised cross-modal semantic anchoring framework powered by multimodal large language models (MLLMs). Unlike conventional contrastive pre-training approaches that rely on large paired datasets, U-CSA exploits the world knowledge and cross-modal reasoning capabilities of MLLMs to generate high-level semantic anchors—interpretable descriptions of salient geo-entities and spatial structures. These anchors provide a unified semantic space, guiding dual-branch image encoders to align visual features through anchored contrastive learning. The semantically enriched encoders are then incorporated into an adversarial matching network, where dynamically generated sample pairs enable fine-grained discrimination between matched and unmatched regions. Extensive experiments demonstrate the superiority of U-CSA over other state-of-the-art approaches.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11036

Loading