Keywords: vision-language models, remote sensing change detection, confidence-aware fusion, trustworthy multi- modal learning
TL;DR: Siamese Temporal Adapter and Reliability-Aware Fusion for Bi-Temporal Remote Sensing Change Reasoning
Abstract: Bi-temporal remote sensing change detection is essential for urban monitoring, disaster response, and infrastructure assessment.
However, existing change detection models often rely on task-specific dense supervision and are highly sensitive to temporal misalignment, background clutter, and cross-domain distribution shifts.
To address these limitations, we propose \textbf{TMCD-RS}, a lightweight vision--language framework that reformulates bi-temporal change detection as text-guided structural change reasoning.
Built upon a frozen CLIP backbone, TMCD-RS adopts a shared-weight Siamese visual encoder to process pre-event and post-event images jointly.
Instead of relying solely on raw image differencing, the proposed method performs temporal reasoning in the feature space by combining absolute bi-temporal representations with residual change evidence derived from their feature discrepancies.
To improve robustness under imperfect temporal correspondence, we introduce a \textbf{reliability-aware temporal fusion} module that predicts a confidence score from the global temporal discrepancy and uses it to adaptively modulate both image-level and pixel-level fusion.
In parallel, a learnable multi-scale fusion module aggregates text-guided anomaly maps from multiple intermediate layers, enabling fine-grained localization of changed structures.
TMCD-RS follows a two-stage training strategy.
In Stage~1, lightweight text adapters are optimized to learn disentangled normal and abnormal textual prototypes for change-aware semantic alignment.
In Stage~2, the text branch is fixed, while image adapters and temporal fusion modules are optimized to capture domain-invariant structural change cues from paired observations.
This design preserves the open-vocabulary generalization of CLIP while introducing only a small number of trainable parameters.
Experiments on remote sensing building-change benchmarks demonstrate that TMCD-RS achieves strong localization and image-level discrimination, while showing improved robustness to pre-existing structures and better transferability across datasets.
These results suggest that confidence-aware bi-temporal feature fusion provides an effective and practical direction for trustworthy remote sensing change reasoning.
Submission Number: 6
Loading