Keywords: Vision Language Models, Image Forensics, AIGC Detection
TL;DR: This work fine-tunes a Vision Language Model based on human-annotated data to classify AI-generated images and pinpoint where and why it considers so.
Abstract: The rapid rise of image generation calls for detection methods that are both interpretable and reliable. Existing approaches, though accurate, act as black boxes and fail to generalize to out-of-distribution data, while multi-modal large language models (MLLMs) provide reasoning ability but often hallucinate.
To address these issues, we construct \textbf{FakeXplained} dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, forming the basis for human-aligned, visually grounded reasoning. Leveraging \textbf{FakeXplained}, we develop \textbf{FakeXplainer} which fine-tunes MLLMs with a progressive training pipeline, enabling accurate detection, artifact localization, and coherent textual explanations. Extensive experiments show that \textbf{FakeXplainer} not only sets a new state-of-the-art in detection and localization accuracy ($98.2\%$ accuracy, $36.0\%$ IoU), but also demonstrates strong robustness and out-of-distribution generalization, uniquely delivering spatially grounded, human-aligned rationales. The code and dataset are available at: \href{https://github.com/Gennadiyev/FakeXplain}{https://github.com/Gennadiyev/FakeXplain}.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2139
Loading