Detecting Synthetic Image by Cross-Modal Commonality Interaction

Kai Li, Wenqi Ren, Wei Wang, Linchao Zhang, Xiaochun Cao

Published: 26 Oct 2025, Last Modified: 09 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Existing synthetic image detection approaches can be categorized into three paradigms: spatial, frequency, and fingerprint-based methods. Our analysis reveals a fundamental commonality across these paradigms: a significant reliance on high-frequency image components. This observation highlights the discriminative power of high-frequency information for this task and provides a strong rationale for learning generalized artifact representations based on multi-modal fusion strategies. Building on this insight, we introduce a multi-modal high-frequency interactive detection framework for general synthetic image detection. This framework explicitly integrates high-frequency information from both the spatial and frequency domains. Specifically, its spatial processing branch incorporates a novel high-frequency self-enhancement module to bolster local high-frequency representations. Concurrently, the frequency processing branch utilizes a multi-scale frequency information enhancement module to capture diverse contextual cues. At the feature fusion stage, we propose a pooling-guided cross-modal high-frequency interaction module, which dynamically weights cross-modal information to further reinforce salient high-frequency representations. Extensive experiments on public datasets demonstrate that our proposed framework achieves state-of-the-art performance in real-world detection scenarios.