Keywords: Image Manipulation Detection & Localization, Multimodal Large Language Models
Abstract: Multi-modal Large Language Models (MLLMs) offer powerful reasoning for localizing tampering in images, yet existing MLLM-based approaches suffer from suboptimal localization due to the reliance on exogenous segmentation decoders.
The stitched pipeline introduces information bottlenecks during backpropagation, diluting spatial signals from the MLLM's hidden embeddings and lacking semantic priors for forensic tasks, which leads to imprecise masks and poor generalization in Image Manipulation Detection \& Localization (IMDL).
To address those limitations, we propose TamperTok, which reformulates MLLM-based IMDL as an autoregressive sequence generation task.
Unlike existing approaches relying on exogenous decoder for localization, TamperTok directly generates spatially grounded token sequences from the MLLM, enabling precise probabilistic mask prediction without intermediary supervisions.
Specifically, we introduce Kernel Splatting Decoder (KSD) to mitigate the sharp gradients caused by deterministic map in codebook-based detokenizer via clustering-aware code smoothing while mapping tokens to binary masks.
In addition, to compensate for the lacking priors of diverse tampering types, i.e., splicing and semantic forgeries, we propose a novel Scene-wise Expert Injection (SwEI) to select and inject multi-scale tampering-specific features from a forensic expert model into the MLLM.
Extensive experiments show that TamperTok achieves state-of-the-art (SOTA) performance on multiple tampering localization datasets, with 20\% improvements in IoU and F1 over existing MLLM-based models, while exhibiting stronger robustness to noise perturbations and cross-domain scenarios.
Codes will be released.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11262
Loading