VLMAWR: A Method for Manchu Archives Word Recognition Based on Vision-Language Model

Published: 01 Jan 2025, Last Modified: 16 Nov 2025ICDAR (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The Manchu archives are historical records formed during the Qing Dynasty in China and hold significant value. Word recognition is crucial for their protection and development. However, the degradation of word images caused by ink infiltration, expansion, discoloration, and stains due to long-term storage makes it difficult for traditional visual model-based word recognition methods to achieve good performance. To address this problem, this paper draws on the visual-language model MATRN proposed in the field of scene image recognition, and overcomes the impact of word image degradation by introducing the visual masking mechanism of VisionLAN and the language masking mechanism of ABINet, thereby constructing a Manchu archives word recognition method based on a visual-language model. The backbone network employs a ResNet architecture combined with attention mechanisms for feature extraction. The visual module generates character-level masked features through the MLM sub-module. Subsequently, the VRM sub-module infers the masked words based on visual context information, thereby enhancing the contextual reasoning ability of the model. The language module conducts masked training through BCN, allowing each character to simultaneously attend to adjacent context information, thereby strengthening the error correction capability of the model. The experimental results show that the method proposed in this paper is superior to traditional methods and can achieve an accuracy of 98.68% on the real Manchu archives dataset.
Loading