Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
Abstract: The detection and grounding of manipulated content in multimodal
data has emerged as a critical challenge in media forensics. While ex-
isting benchmarks demonstrate technical progress, they suffer from
misalignment artifacts that poorly reflect real-world manipulation
patterns: practical attacks typically maintain semantic consistency
across modalities, whereas current datasets artificially disrupt cross-
modal alignment, creating easily detectable anomalies. To bridge
this gap, we pioneer the detection of semantically-coordinated
manipulations where visual edits are systematically paired with
semantically consistent textual descriptions. Our approach begins
with constructing the first Semantic-Aligned Multimodal Manip-
ulation (SAMM) dataset, generated through a two-stage pipeline:
1) applying state-of-the-art image manipulations, followed by 2)
generation of contextually-plausible textual narratives that rein-
force the visual deception. Building on this foundation, we propose
a Retrieval-Augmented Manipulation Detection and Grounding
(RamDG) framework. RamDG commences by harnessing exter-
nal knowledge repositories to retrieve contextual evidence, which
serves as the auxiliary texts and encoded together with the inputs
through our image forgery grounding and deep manipulation de-
tection modules to trace all manipulations. Extensive experiments
demonstrate our framework significantly outperforms existing
methods, achieving 2.06% higher detection accuracy on SAMM com-
pared to state-of-the-art approaches. The dataset and code are pub-
licly available at https://github.com/shen8424/SAMM-RamDG-CAP
Loading