Abstract: Ancient Mongolian documents are valuable repositories of historical information and cultural significance. Analyzing these documents effectively demands specialized recognition research, as the absence of some words in current lexicons makes recognizing out-of-vocabulary (OOV) words crucial. To better recognize ancient Mongolian documents, an end-to-end approach based on multi-feature fusion called Ancient Mongolian Documents Recognition Unit (AMDRU) is proposed in this paper. This approach improves the ability of the model to understand images in ancient documents by leveraging information from word images at different scales. AMDRU receives word images and processes them through a custom-designed feature extractor to capture multi-scale structural details. These features are then input into an encoder utilizing the efficient additive attention mechanism, enabling superior understanding and representation of essential information. The encoded features are passed to a Transformer decoder to convert image data into text. The final output is a prediction of the corresponding strings. To address the uneven data distribution in ancient documents and enhance the learning of rare word images, the asymmetric loss is utilized, which significantly improves the model’s ability to learn from word images and boosts recognition performance. Experimental results demonstrate that our proposed approach can capture the structural features of characters in ancient Mongolian documents more accurately, and its recognition performance outperforms existing methods. It shows particularly better performance in the challenging task of recognizing OOV words.
Loading