Abstract: This paper proposes an end-to-end two-branch network for a document image perceptual hashing scheme, where the two branches focus on image visual features and text features, respectively. Existing perceptual hashing schemes cannot solve the problem of the tiny proportion of text tampering detection, while simple text detection is unable to solve the problem of background region-aware matching. To address these issues, we extract text information via optical character recognition (OCR) and then generate the text features using the bidirectional encoder representations from Transformers (BERT). Visual features of the image are extracted from the local and global features of the image using ResNet and Vision Transformer cascades, and then fused to generate the final hash sequence through the fully connected layer. The proposed network considers both image visual features and textual information to verify that the document image has not been tampered with. In our network, the OCR module enables accurate and intelligent text detection and recognition, particularly for dealing with text tampering that has only been conducted in tiny portions. It also provides more efficient and robust text recognition services. Experimental results show that the proposed hashing scheme is robust and discriminative in document images.
External IDs:dblp:journals/tbd/SituLYQZ25
Loading