Learning Fine-Grained and Semantically Aware Mamba Representations for Tampered Text Detection in Images

Published: 01 Jan 2024, Last Modified: 17 Apr 2025PRCV (7) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Tampered text detection in images, as a task focused on detecting manipulated or forged text with image documentation or signage, has increasingly attracted attention due to the widespread use of image editing software and CNN synthesis techniques. The potential difficulties of perceiving subtle differences in tampered text images lie in the gap between the model’s capability to obtain global fine-grained information and the realistic demand. In this work, we propose a robust detection method, Tampered Text Detection with Mamba (TTDMamba). It achieves linear complexity without sacrificing global spatial contextual information, offering significant advancements over the limitations of the Transformer architecture. In particular, we utilize the advanced VMamba architecture as the encoder and incorporate the proposed High-frequency Feature Aggregation to enhance the visual feature set by integrating additional signals. This aggregation guides Mamba’s attention toward capturing fine-grained forgery information. Additionally, we integrate Disentangled Semantic Axial Attention into the stacked Visual State Space block of the VMamba architecture. This integration allows us to incorporate the inherent high-level semantic attributes of the tampered image into a pretrained hierarchical converter. As a result, we obtain a tamper map that is more reliable and accurate. Extensive experiments on the T-SROIE, T-IC13, and DocTamper datasets demonstrate that TTDMamba not only surpasses existing state-of-the-art methods in detection accuracy but also shows superior robustness in pixel-level forgery localization, marking a significant contribution to the domain of text tampering detection.
Loading