Abstract: Change captioning involves describing the subtle changes between a pair of similar images. Although existing efforts have achieved compelling success, they overlook the potential of multimodal large language models (MLLMs) in tackling this challenging task. In this work, we aim to empower MLLMs with the capability to perceive subtle differences between paired images and enhance their performance in generating change captions. Specifically, we present a diFferentIal-perceptive aNd rEtRieval-augmented MLLM (FINER-MLLM) tailored for this task. In particular, FINER-MLLM leverages LoRA fine-tuned MLLM's image encoder to extract image patch features, enabling the capture of detailed image information. Subsequently, within MLLM's feature extraction, typically Q-Former, FINER-MLLM incorporates dual constraints: the intra-image feature independence constraint and the inter-image feature alignment constraint. These constraints ensure that the features can comprehensively extract subtle visual information within each image and that corresponding features across images align effectively.Last, we introduced the retrieval augmentation to first retrieve the relevant corpus to facilitate the MLLM's decoder \textit{i.e.}, LLM, in generating accurate change captions. Extensive experiments on three benchmark datasets, \textit{i.e.}, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the superiority of our proposed method.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: In this paper, we focused on the task of change captioning, which involves describing the subtle differences between a pair of similar images in natural language. It is a natural extension of the traditional image captioning task and can be applied in various scenarios, such as video surveillance identification. Existing multimodal large language models (MLLMs) predominantly focus on understanding single images and struggle to synthesize the nuances between multiple images. Toward this end, we aimed to improve the performance of MLLMs in the context of image change captioning. Specifically, we proposed a Retrieval-Augmented Viewpoint Consistency Modeling method for this new yet challenging multimodal task. The main idea is to model the viewpoint consistency between the two images during visual feature extraction and to employ the retrieval-augmented technique to enhance the text generation performance. Our work is closely related to the vision language subject area and improves the performance of multimedia foundation models on change captioning. We believe this paper can advance the field of change captioning and inspire the multimedia community to extend MLLMs to tackle tasks involving multiple images.
Submission Number: 4037
Loading