Difference-aware Visiolinguistic Regularization for Image Change Captioning

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Change Captioning, Image Captioning, Vision-Language Models
TL;DR: We propose DAVIR which improves MLLM fine-tuning by guiding the encoder to focus on subtle visual changes and enhancing decoder caption generation via entity prompts, achieving state-of-the-art results across multiple benchmarks.
Abstract: Image Change Captioning (ICC) has emerged as an important task in multi-modal generative AI, aiming to generate natural language descriptions that reflect the differences between two similar images. Unlike traditional image captioning, ICC requires strong cross-image difference reasoning and language generation capabilities to handle diverse and complex scenarios. Recent advances have introduced MLLM-based methods for ICC, achieving impressive results. However, these approaches rely solely on caption-level supervision to implicitly infer and describe changes, which often results in the omission of fine-grained differences and suboptimal caption quality. To address this, we propose a Difference-Aware Visiolinguistic Regularization (DAVIR) paradigm that jointly regularizes the fine-tuning of MLLM from both visual and linguistic perspectives, enabling better adaptation to ICC. Specifically, we first introduce a fine-grained attention control module to regularize the final-layer self-attention maps of the MLLM’s encoder, guiding it to focus on subtle changes during feature extraction. Second, we propose an entity prompt construction scheme to guide the MLLM’s decoder and enhance caption generation quality. Extensive experiments on three benchmark datasets across different scenarios demonstrate that our method achieves state-of-the-art performance. The code will be released publicly.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7950
Loading