The image and ground truth dataset of Mongolian movable-type newspapers for text recognition

Min Lu, Feilong Bao, Hui Zhang, Guanglai Gao

Published: 01 Jan 2024, Last Modified: 15 May 2025Int. J. Document Anal. Recognit. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: OCR approaches have been widely advanced in recent years thanks to the resurgence of deep learning. However, to the best of our knowledge, there is little work on Mongolian movable-type document recognition. One major hurdle is the lack of a domain-specific well-labeled set for training robust models. This paper aims to create the first Mongolian movable type text-image dataset for OCR research. We collated 771 paragraph-level pages segmented from 34 newspapers from 1947 to 1952. For each page, word- and line-level text transcriptions and boundary annotations are recorded. It consists of 86,578 word appearances and 9711 text-line images in total. The vocabulary is 7964. The dataset was finally established from scratch through image collection, text transcription, text-image alignment and manual correction. Moreover, an official train and test set partition is defined on which the typical text segmentation and recognition experiments are tested to set the strong baselines. This dataset is available for research, and we encourage researchers to develop and test new methods using our dataset.