TMCD-RS: Trustworthy Multimodal Change  Detection for Remote Sensing

Wei-Chieh Sun; SHIH-CHIH LIN; Jia-Xian Jian; YunTung Chu; Fang-Yi Lin

TMCD-RS: Trustworthy Multimodal Change Detection for Remote Sensing

Wei-Chieh Sun, SHIH-CHIH LIN, Jia-Xian Jian, YunTung Chu, Fang-Yi Lin

Published: 21 Apr 2026, Last Modified: 21 Apr 2026TrustVLMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language models, remote sensing change detection, confidence-aware fusion, trustworthy multi- modal learning

TL;DR: Siamese Temporal Adapter and Reliability-Aware Fusion for Bi-Temporal Remote Sensing Change Reasoning

Abstract: Bi-temporal remote sensing change detection is essential for urban monitoring, disaster response, and infrastructure assessment. However, existing change detection models often rely on task-specific dense supervision and are highly sensitive to temporal misalignment, background clutter, and cross-domain distribution shifts. To address these limitations, we propose \textbf{TMCD-RS}, a lightweight vision--language framework that reformulates bi-temporal change detection as text-guided structural change reasoning. Built upon a frozen CLIP backbone, TMCD-RS adopts a shared-weight Siamese visual encoder to process pre-event and post-event images jointly. Instead of relying solely on raw image differencing, the proposed method performs temporal reasoning in the feature space by combining absolute bi-temporal representations with residual change evidence derived from their feature discrepancies. To improve robustness under imperfect temporal correspondence, we introduce a \textbf{reliability-aware temporal fusion} module that predicts a confidence score from the global temporal discrepancy and uses it to adaptively modulate both image-level and pixel-level fusion. In parallel, a learnable multi-scale fusion module aggregates text-guided anomaly maps from multiple intermediate layers, enabling fine-grained localization of changed structures. TMCD-RS follows a two-stage training strategy. In Stage~1, lightweight text adapters are optimized to learn disentangled normal and abnormal textual prototypes for change-aware semantic alignment. In Stage~2, the text branch is fixed, while image adapters and temporal fusion modules are optimized to capture domain-invariant structural change cues from paired observations. This design preserves the open-vocabulary generalization of CLIP while introducing only a small number of trainable parameters. Experiments on remote sensing building-change benchmarks demonstrate that TMCD-RS achieves strong localization and image-level discrimination, while showing improved robustness to pre-existing structures and better transferability across datasets. These results suggest that confidence-aware bi-temporal feature fusion provides an effective and practical direction for trustworthy remote sensing change reasoning.

Submission Number: 6

Loading