Keywords: Forgery Localization, Forgery Interpretation, Deepfake, Multimodal Learning, Large Language Models
TL;DR: We built ForgeryTalker, a model trained on our new image-text dataset (MMTT), to both locate tampered regions in an image and explain in text why they look fake.
Abstract: Existing facial forgery detection methods typically focus on binary classification or pixel-level localization, providing little semantic insight into the nature of the manipulation. To address this, we introduce Forgery Attribution Report Generation, a new multimodal task that jointly localizes forged regions ("Where") and generates natural language explanations grounded in the editing process ("Why"). This dual-focus approach goes beyond traditional forensics, providing a comprehensive understanding of the manipulation. To enable research in this domain, we present **M**ulti-**M**odal **T**amper **T**racing (**MMTT**), a large-scale dataset of 152,217 samples, each with a process-derived ground-truth mask and a human-authored textual description, ensuring high annotation precision and linguistic richness. We further propose ForgeryTalker, a unified end-to-end framework that integrates vision and language via a shared encoder (image encoder + Q-former) and dual decoders for mask and text generation, enabling coherent cross-modal reasoning. Experiments show that ForgeryTalker achieves competitive performance on both report generation and forgery localization subtasks, i.e., 59.3 CIDEr and 73.67 IoU, respectively, establishing a baseline for explainable multimedia forensics. Dataset and code will be released to foster future research.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1894
Loading