Learning to Describe Urban Change: Graph-Guided Detection and spatio-Temporal State Space Model with Uncertainty Estimation

Learning to Describe Urban Change: Graph-Guided Detection and spatio-Temporal State Space Model with Uncertainty Estimation

ICLR 2026 Conference Submission16463 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Change Detection, Change Captioning, State Space Model, Uncertainity Estimation, Urban development monitoring, Deep Learning

TL;DR: We propose SemanticGraphCD for robust change detection with an SSM-based captioning model, enhanced by Semantic-Weighted Sentence Entropy (SWSE), which estimates uncertainty in satellite image change captioning for urban monitoring.

Abstract: Automated change detection (CD) and captioning from satellite imagery plays a crucial role in urban development monitoring, infrastructure assessment, and land-use analysis. However, existing change captioning systems lack uncertainty quantification, making it challenging to assess prediction reliability when analysing critical infrastructure changes, building construction, or environmental modifications where inaccurate interpretations could impact urban planning decisions or infrastructure management. We address this limitation through a comprehensive pipeline combining SemanticGraphCD module for enhanced change detection with a State Space Model(SSM)-based captioning module for scalable description generation. SemanticGraphCD integrates graph neural networks with task-agnostic semantic learning, employing an adaptive processing mechanism that dynamically switches between GNN-based feature propagation and convolutional operations. This architecture learns semantic representations through bi-temporal consistency constraints, better discriminating meaningful infrastructure and land-use changes from temporal variations in very high-resolution imagery. The State Space Model based captioning module contains a Spatial Difference-aware SSM (SD-SSM) which improves upon previous CNN and Transformer-based models in receptive field. Moreover a Temporal Traversing SSM (TT-SSM) is used which scans bi-temporal features in a temporal cross-wise manner enhancing the model's temporal understanding and information interaction. This SSM is guided by SemanticGraphCD's change masks using a convolutional focusing module which aggregates change information from the masks with the bitemporal images. This guides the model in representing the changes between the bi-temporal images within the state space model hidden states, enabling linear computational scaling while maintaining competitive performance. Instead of treating all caption tokens equally in the context of change detection, we introduce Semantic-Weighted Sentence Entropy (SWSE) for principled uncertainty quantification. SWSE emphasizes domain-relevant vocabulary over function words, providing interpretable confidence measures that correlate with caption quality. Experimental results demonstrate that our approach achieves improvement in captioning performance compared to existing state space models, while SWSE provides reliable uncertainty estimates for informed decision-making in urban monitoring applications.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16463

Loading