Detecting Synthetic Image by Cross-Modal Commonality Interaction
Abstract: Existing synthetic image detection approaches can be categorized into three paradigms: spatial, frequency, and fingerprint-based methods. Our analysis reveals a fundamental commonality across these paradigms: a significant reliance on high-frequency image components. This observation highlights the discriminative power of high-frequency information for this task and provides a strong rationale for learning generalized artifact representations based on multi-modal fusion strategies. Building on this insight, we introduce a multi-modal high-frequency interactive detection framework for general synthetic image detection. This framework explicitly integrates high-frequency information from both the spatial and frequency domains. Specifically, its spatial processing branch incorporates a novel high-frequency self-enhancement module to bolster local high-frequency representations. Concurrently, the frequency processing branch utilizes a multi-scale frequency information enhancement module to capture diverse contextual cues. At the feature fusion stage, we propose a pooling-guided cross-modal high-frequency interaction module, which dynamically weights cross-modal information to further reinforce salient high-frequency representations. Extensive experiments on public datasets demonstrate that our proposed framework achieves state-of-the-art performance in real-world detection scenarios.
Loading