A Strong Diffusion-generated Image Detector via Cross-modal Representation Learning with Neighboring Pixel Relationships

Tao Gong; Dayong Wang; Qi Chu; Bin Liu; Nenghai Yu

A Strong Diffusion-generated Image Detector via Cross-modal Representation Learning with Neighboring Pixel Relationships

Tao Gong, Dayong Wang, Qi Chu, Bin Liu, Nenghai Yu

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion-generated Image Detection;

Abstract: The astonishing proficiency and unprecedented level of realism of diffusion models in creating and manipulating images have undoubtedly drawn concerns. Many methods have been proposed to detect generated images, and most of them take RGB modality as input. Recently, the concept of Neighboring Pixel Relationships (NPR) is proposed to capture and characterize the generalized structural artifacts stemming from up-sampling operations that usually exists in the generation process. The classifier with only the NPR modality as input can achieve good generalizable performance on detecting generated images. Intriguingly, there has been a scarcity of investigative inquiry by considering both RGB and NPR modalities. To this end, this paper leverages features from both RGB and NPR modalities to detect generated images. Specifically, we propose a Strong Diffusion-generated Image Detector (SDID) by taking advantage of two different but complementary representation learning methods, Cross-Modal Contrastive Learning (CMCL) and Cross-Modal Mutual Distillation (CMMD). The CMCL boosts the discrimination of features between real and fake images. While the CMMD simultaneously transfers the learned knowledge between two modalities. CMCL and CMMD work collaboratively so that each modality learns a more comprehensive representation to distinguish real and fake images. Extensive experiments on GenImage, DRCT-2M, and Co-Spy-Bench datasets show that the proposed SDID achieves state-of-the-art results without bells and whistles. Codes will be open-sourced after being accepted.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24678

Loading