Toward Robust Image Manipulation Localization: A Novel Framework with VLMs and Weight-Aware Decoder

Jiwei Zhang; Minghao Jia; Feifei Kou; Siwei Wang; Yifan Zhu; Lei Shi; Pengfei Zhang; ShaoZhang Niu

Toward Robust Image Manipulation Localization: A Novel Framework with VLMs and Weight-Aware Decoder

Jiwei Zhang, Minghao Jia, Feifei Kou, Siwei Wang, Yifan Zhu, Lei Shi, Pengfei Zhang, ShaoZhang Niu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Manipulation Localization, Visual-Language Models, Joint Information Supervision, Weight-Aware

TL;DR: We propose a novel VLMs-based framework named VLWA-Net for image manipulation localization

Abstract: Image Manipulation Localization (IML) aims to identify and pinpoint regions within an image that have been forged or manipulated. Although some progress has been made in the task of IML, existing techniques still face several challenges. First, tampering techniques are diverse and complex, leaving various tampering artifacts in images. To effectively identify different types of tampered images, the model must extract comprehensive and highly discriminative tampering features. Second, some frameworks of IML use identical weights to fuse features from different scales during the decoding process, ignoring the varying sensitivity of different scales to the prediction results. To address these challenges, we propose a novel framework VLWA-Net, based on Vision-Language Models (VLMs). This framework leverages a VLMs-enhanced Artifact Extractor and a Multi-Domain Artifact Modulator to capture rich and discriminative tampering features, combining with traditional noise features as auxiliary cues. Next, we introduce a Weight-Aware Decoder (WAD) that comprehensively accounts for the sensitivity differences across scales and among feature points within the same scale. Additionally, the overall framework is trained using a Joint Information Supervision strategy, which enhances the model’s ability to capture and perceive the details of tampered regions. The experimental results demonstrate that the proposed framework significantly improves accuracy on multiple mainstream test datasets and exhibits strong robustness and generalization capabilities.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11733

Loading