Keywords: Image Manipulation Localization, Visual-Language Models, Joint Information Supervision, Weight-Aware
TL;DR: We propose a novel VLMs-based framework named VLWA-Net for image manipulation localization
Abstract: Image Manipulation Localization (IML) aims to identify and pinpoint regions within an image that have been forged or manipulated. Although some progress has been made in the task of IML, existing techniques still face several challenges. First, tampering techniques are diverse and complex, leaving various tampering artifacts in images. To effectively identify different types of tampered images, the model must extract comprehensive and highly discriminative tampering features. Second, some frameworks of IML use identical weights to fuse features from different scales during the decoding process, ignoring the varying sensitivity of different scales to the prediction results. To address these challenges, we propose a novel framework VLWA-Net, based on Vision-Language Models (VLMs). This framework leverages a VLMs-enhanced Artifact Extractor and a Multi-Domain Artifact Modulator to capture rich and discriminative tampering features, combining with traditional noise features as auxiliary cues. Next, we introduce a Weight-Aware Decoder (WAD) that comprehensively accounts for the sensitivity differences across scales and among feature points within the same scale. Additionally, the overall framework is trained using a Joint Information Supervision strategy, which enhances the model’s ability to capture and perceive the details of tampered regions. The experimental results demonstrate that the proposed framework significantly improves accuracy on multiple mainstream test datasets and exhibits strong robustness and generalization capabilities.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11733
Loading