Zoom-In to Sort AI-Generated Images Out

Zoom-In to Sort AI-Generated Images Out

ICLR 2026 Conference Submission6032 Authors

15 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image & Video Synthesis, Multi-modal Large Language Models

TL;DR: We fine-tune VLMs to identify key regions on potentially AI-generated images that, upon closer observation, can yield a more grounded, explainable and accurate classification result..

Abstract: The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose **ZoomIn**, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict. To support training, we introduce **MagniFake**, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 6032

Loading