Multi-view Feature Extraction via Tunable Prompts is Enough for Image Manipulation Localization

Xuntao Liu; Yuzhou Yang; Haoyue Wang; Qichao Ying; Zhenxing Qian; Xinpeng Zhang; Sheng Li

Multi-view Feature Extraction via Tunable Prompts is Enough for Image Manipulation Localization

Xuntao Liu, Yuzhou Yang, Haoyue Wang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang, Sheng Li

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deceptive images can quickly spread via social networking services, posing significant risks. The rapid progress in Image Manipulation Localization (IML) seeks to address this issue. However, the scarcity of public training datasets in the IML task directly hampers the performance of models. To address the challenge, we propose a Prompt-IML framework, which leverages the rich prior knowledge of pre-trained models by employing tunable prompts. Specifically, sets of tunable prompts enable the frozen pre-trained model to extract multi-view features, including spatial and high-frequency features. This approach minimizes redundant architecture for feature extraction across different views, resulting in reduced training costs. In addition, we develop a plug-and-play Feature Alignment and Fusion module that seamlessly integrates into the pre-trained models without additional structural modifications. The proposed module reduces noise and uncertainty in features through interactive processing. The experimental results showcase that our proposed method attains superior performance across 6 test datasets, demonstrating exceptional robustness.

Primary Subject Area: [Experience] Multimedia Applications

Secondary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: we explore the potential of utilizing existing pre-trained models to address the scarcity of public available datasets in the IML task. This is related to multimedia. We propose Prompt-IML, which utilizes a single pre-trained network to extract multi-view features through prompt tuning. A specially designed Feature Alignment and Fusion (FAF) module is employed to integrate multi-view features, effectively reducing noise and uncertainty in features, suppressing sporadic positive responses. We hope that our work can reduce the presence of fake images, which is also beneficial for the development of multimedia.

Supplementary Material: zip

Submission Number: 302

Loading