Keywords: Image & Video Synthesis, Multi-modal Large Language Models
TL;DR: We fine-tune VLMs to identify key regions on potentially AI-generated images that, upon closer observation, can yield a more grounded, explainable and accurate classification result..
Abstract: The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose **ZoomIn**, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict.
To support training, we introduce **MagniFake**, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6032
Loading