DiffTell: A Comprehensive Dataset for Image Difference Captioning

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Difference Caption, Vision Language Task, A Comprehensive Dataset
TL;DR: A comprehensive and large-scale image difference captioning dataset in high quality
Abstract: The image Difference Captioning (IDC) task is to describe the distinctions between two images. However, existing datasets do not offer comprehensive coverage across all image-difference categories. In this work, we introduce a more extensive dataset, \textit{DiffTell}, which encompasses various types of differences between images, including global image alterations, object-level changes, and text manipulations. \textit{DiffTell} includes both newly collected data and filtered data used in previous studies. Additionally, to scale up the data collection without prohibitive human labor costs, we explore the possibility of automatically filtering for quality control. We prove that both traditional methods and recent multimodal large language models (MLLMs) show improved performance on the IDC task after training on the \textit{DiffTell} dataset. We conducted extensive ablation studies to provide a thorough analysis of the performance gain from \textit{DiffTell}. Experiments show \textit{DiffTell} significantly enhances the availability of resources for IDC research, offering a more comprehensive foundation and benchmark for future investigations.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7912
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview