Scene Change Detection with Vision-Language Representation Learning

Scene Change Detection with Vision-Language Representation Learning

ICLR 2026 Conference Submission19959 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scene Change Detection, Vision-Language Models, Multimodal Learning, Urban Monitoring, Visual Place Recognition, Dataset Annotation, Street-View Data, Object-Level Change Detection

Abstract: Scene change detection (SCD) is crucial for urban monitoring and navigation, but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely solely on low-level visual features, limiting their ability to accurately identify various changed objects amid the visual complexity of urban scenes. In this paper, we propose a vision-language framework for scene change detection that breaks through the single-modal bottleneck by incorporating semantic understanding through language. Our approach features a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of detected changes, which are fused with visual features through a feature enhancer. Additionally, we introduce a geometric-semantic matching module that refines the predictions. To enable comprehensive evaluation, we present NYC-CD, a large-scale dataset of 8,122 real-world image pairs from New York City with multiclass change annotations, created through our semi-automatic annotation pipeline. Our method demonstrates strong performance across street-view benchmarks, achieving state-of-the-art results through semantic-visual feature integration. Extensive experiments demonstrate that our language module consistently improves existing change-detection architectures by substantial margins, highlighting the fundamental value of incorporating linguistic reasoning into visual change detection systems.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 19959

Loading