Region-aware Difference Distilling with Attribute-guided Contrastive Regularization for Change Captioning
Abstract: Change captioning aims to describe the differences between two similar images using natural language, significantly aiding in understanding and monitoring changes. This challenging task requires a fine-grained understanding of subtle changes while resisting disturbances like viewpoint shifts and illumination variations. Existing methods often rely solely on global difference features and lack comprehensive alignment of linguistic and visual information, leading to overlooking fine-grained details and generating semantic hallucinated sentences. To address these limitations, we propose the region-aware difference distilling (RDD) network with attribute-guided contrastive regularization (ACR). The RDD uses global difference features to progressively distill regional difference features using learnable vectors, allowing for more precise identification of changed regions. The ACR enhances comprehensive alignment between linguistic and visual information by formulating Nouns-to-Objects (N2O) and Verbs-to-Actions (V2A) alignment losses to regularize the regional difference features. Promising results on three datasets demonstrate that our method outperforms the state-of-the-art change captioning methods.
Loading