SAR-TEXT: From Imperfect Multimodal Earth Observation to Large-Scale SAR–Language Supervision

Xinjun Cheng; Chunping Qiu; Xichuan Zhang; Qiangjuan Huang; Ke Yang; YiguoHe

SAR-TEXT: From Imperfect Multimodal Earth Observation to Large-Scale SAR–Language Supervision

Xinjun Cheng, Chunping Qiu, Xichuan Zhang, Qiangjuan Huang, Ke Yang, YiguoHe

Published: 21 May 2026, Last Modified: 01 Jun 2026MONTI 2026 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SAR, vision-language models, image-text retrieval, multimodal remote sensing, progressive transfer learning

TL;DR: We build large-scale SAR-language supervision and show that progressive transfer improves SAR image-text retrieval, captioning, and downstream VQA under imperfect multimodal Earth observation.

Abstract: Multimodal Earth observation is rarely complete, synchronized, or uniformly informative in practice. Optical imagery may be unavailable or unreliable under cloud cover, nighttime conditions, or adverse weather, whereas synthetic aperture radar (SAR) remains observable but is substantially harder to interpret semantically. This modality gap limits the applicability of modern vision--language models in realistic remote sensing pipelines. In this paper, we study whether large-scale SAR--language supervision can serve as a practical bridge for multimodal representation learning under imperfect observations. We present SAR-TEXT, a 136,584-pair SAR image--text dataset built from heterogeneous SAR sources using a multi-stage caption generation pipeline, including annotation-to-caption conversion, segmentation-guided captioning, and rule-guided rewriting from optical descriptions. We further adopt a progressive transfer strategy that adapts vision--language foundation models from natural images to optical remote sensing and then to SAR. Experiments on cross-modal retrieval, caption generation, and downstream SAR visual question answering show that large-scale SAR--language supervision substantially improves performance over direct-transfer baselines. Human auditing further indicates that the automatically generated captions are generally usable at scale, while failure cases reveal the main bottlenecks under imperfect semantic supervision. Our results suggest that SAR--language alignment is a promising mechanism for robust multimodal remote sensing when observations are heterogeneous, incomplete, weakly paired, or only partially observable.

Submission Number: 12

Loading