Abstract: Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10\% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15\% over its version without domain adaptation mechanisms, effectively bridging the domain gap. Our code is publicly accessible, and the dataset will be made available at https://anonymous.4open.science/r/UrbanCross/.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: UrbanCross represents a significant advancement in the field of multimedia/multimodal processing by introducing an innovative framework aimed at improving the adaptability of satellite image-text retrieval across different domains, a previously unexplored area. Past research has primarily focused on enhancing retrieval performance without considering the variations in data distribution among different cities or countries. Our study directly tackles these challenges by incorporating a domain adaptation module to address the complexities of handling diverse urban satellite imagery from multiple sources along with their textual descriptions. Additionally, we leverage the capabilities of the Large Multimodal Model and Segment Anything Model for data augmentation, enhancing both visual and textual aspects. To support our experiments, we have curated a cross-country dataset with varied Geo-Tags, serving as a benchmark for future research endeavors. Extensive evaluations demonstrate that UrbanCross not only enhances the satellite image-text retrieval task but also improves model adaptability to new domains and environments.  This framework not only embodies technical innovations but also offers fresh insights for processing and analyzing multimodal content, aiming to deepen the understanding of urban geographic semantics using satellite images and text. Consequently, it can further facilitate various urban tasks, such as urban region profiling and cross-city spatio-temporal model transfer related to environmental, social, and economic aspects.
Supplementary Material:  zip
Submission Number: 1213
Loading