Abstract: Recent generative artificial intelligence (AI) advancements have enabled high-quality images with cross-modal controllability, e.g., text-to-image generation. Vision-language models (VLMs) play an indispensable role in such cross-modal generation tasks by aligning the representations of conceptions in different modalities, which inspires us to detect VLM-generated images with traces and fingerprints of VLMs. Therefore, in this paper, we propose a one-class classification (OOC) framework, namely Outliers are Real (OaReal) to recognize VLM-generated images. In sight of the rarity of VLMs due to the high cost of training, we regard VLM-generated images as normal samples and explore their distributions in CLIP latent space. During the testing phase, samples far from the explored distribution (outliers) are regarded as real images, while samples in the explored distribution are classified as VLM-generated samples. Compared with binary classifiers, the OOC design of our OaReal significantly relieves training from learning complex patterns with diverse real and fake images. Furthermore, we propose a hard sample aware contrastive loss (Harsacol) that takes edited samples into consideration and improves the inclusiveness of the explored space for both VLM-synthetic and VLM-edited samples. We conducted comprehensive experiments to test the performance of our OaReal. The results suggested the superiority of our methods in terms of both effectiveness and efficiency.
Loading