CLIP-ADA: CLIP-Guided Artifact-Invariant Generalizable Synthetic Image Detection

Jingyi Deng, Chong Zhang, Chenken Xu, Chenhao Lin, Zhengyu Zhao, Shuai Liu, Qian Wang, Chao Shen

Published: 2026, Last Modified: 28 May 2026IEEE Trans. Inf. Forensics Secur. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rapid advancement of generative models necessitates detection methods that generalize to synthetic images containing diverse generator and semantic artifacts. Recent research has leveraged pre-trained vision-language models, such as CLIP, to extract forensic features that distinguish real and fake images, illustrating their promising performance in synthetic image detection. However, a systematic investigation into the embedding space of CLIP to guide its principled utilization for synthetic image detection remains largely unexplored. This paper addresses this gap by first analyzing the multi-stage CLIP image embedding space to uncover its relationship with cross-artifact forensic patterns. Our findings reveal that the mid-level stages primarily encode forensic and generator artifact features, while the high-level stages primarily encode semantic artifact features. Building upon these insights, we propose the CLIP-guided Dual-level Augmentation and Forensic Distribution Adaptation (CLIP-ADA) framework to perform artifact-invariant generalizable detection. Specifically, dual-level augmentation diversifies fake embeddings and suppresses artifact encoding during training to mitigate detectors from excessively relying on artifact features. Moreover, forensic distribution adaptation reformulates synthetic image detection as identifying distributional deviations from the CLIP encoded real embeddings and thereby designing adapters to extract cross-artifact forensic features in a detection scenario-adaptive manner. Extensive evaluations on both the conventional single-generator and continual learning-based multi-generator training settings demonstrate the effectiveness of our method, both suppressing the state-of-the-art methods by over 6% of average accuracy on unseen data from more than 10 generators.
Loading