Concentrated Reasoning and Unified Reconstruction for Multi-Modal Media Manipulation

Weichen Zhao, Yuxing Lu, Ge Jiao, Yuan Yang

Published: 01 Jan 2024, Last Modified: 23 Jul 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Detecting and Grounding Multi-Modal Media Manipulation (DGM 4 ) is an emerging task that aims to identify and locate manipulated elements in both textual and visual media. Given the complexity of this task, the model requires more sophisticated reasoning capabilities to align multi-modal features and capture forgery traces. To this end, we propose a Concentrated reasoning and Unified reconstruction framework (CrUr) for DGM 4 . Instead of adhering to traditional hierarchical reasoning paradigms, we directly carry out all inference tasks using integrated multi-modal features. Specifically, we extract and align features at a finer granularity, capturing subtle differences that may indicate manipulation by leveraging advanced mask signal modeling. Moreover, to adapt to fine-grained reasoning tasks, we design a transformer-based Reconstruction Harmonizer to facilitate more complex interactions among the reconstructed features, ultimately obtaining integrated features. Experimental results on the DGM 4 datasets show that our method achieves state-of-the-art performances.