A Large-scale Interpretable Multi-modality Benchmark for Image Forgery Localization

22 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Forgery Localization, Forgery Detection, Semantic Segmentation, Deepfake Detection, Multimodal Learning, Explainable AI, Salient Region Detection, Image-Text Pair Dataset, Interpretable Machine Learning, Large Language Models (LLMs)
TL;DR: We propose a framework for image forgery localization that focuses on salient regions and improves interpretability using the MMTT dataset and ForgeryTalker model, outperforming traditional binary mask methods.
Abstract: Image forgery localization, which centers on identifying tampered pixels within an image, has seen significant advancements. Traditional approaches often model this challenge as a variant of image segmentation, treating the segmentation of forged areas as the end product. However, while semantic segmentation provides distinct regions with clear semantics that are readily interpretable by humans, the interpretation regarding the detected forgery regions is less straightforward and is an under explored problem. We argue that the simplistic binary forgery mask, which merely delineates tampered pixels, fails to provide adequate information for explaining the model's predictions. First, the mask does not elucidate the rationale behind the model's localization. Second, the forgery mask treats all forgery pixels uniformly, which prevents it from emphasizing the most conspicuous unreal regions and ultimately hinders human discernment of the most anomalous areas. In this study, we mitigate the aforementioned limitations by generating salient region-focused interpretation for the forgery images, articulating the rationale behind the predicted forgery mask and underscoring the pivotal forgery regions with a interpretation description. To support this, we craft a **M**ulti-**M**odal **T**ramper **T**racing (**MMTT**) dataset, comprising images manipulated using deepfake techniques and paired with manual, interpretable textual annotations. To harvest high-quality annotation, annotators are instructed to meticulously observe the manipulated images and articulate the typical characteristics of the forgery regions. Subsequently, we collect a dataset of 128,303 image-text pairs. Leveraging the MMTT dataset, we develop ForgeryTalker, an architecture designed for concurrent forgery localization and interpretation. ForgeryTalker first trains a forgery prompter network to identify the pivotal clues within the explanatory text. Subsequently, the region prompter is incorporated into multimodal large language model for finetuning to achieve the dual goals of localization and interpretation. Extensive experiments conducted on the MMTT dataset verify the superior performance of our proposed model.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2613
Loading