NomNaOCR: The First Dataset for Optical Character Recognition on Han-Nom Script

Hoang-Quan Dang, Duy-Anh Nguyen, Phu-Phuoc Pham, Ngoc-Thinh Nguyen, Tan Chau, Duc-Vu Ngo, Trung-Hieu Nguyen, Chau-Thang Phan, The-Hien Trinh, Minh-Tri Nguyen, Trong-Hop Do

Published: 2022, Last Modified: 06 Jun 2023RIVF 2022Readers: Everyone

Abstract: In this article, we introduce the NomNaOCR dataset for the old Hán-Nôm script based on 3 tremendous and valuable historical works of Vietnam, including , and With 2953 handwritten Pages collected from the Vietnamese Nôm Preservation Foundation for analyzing and semi-annotating the bounding boxes to generate additional 38,318 Patches containing text along with strings in digital form. This makes NomNaOCR currently become the biggest dataset for script in Vietnam, serving 2 main problems in Optical Character Recognition: Text Detection and Text Recognition. A difference here is that our implementations were all done at the sequence level, which not only saves the annotation cost but also helps us retain the context in the sequence instead of just performing on each individual character as in most previous works. For basic results, we experimented on the validation set of NomNaOCR. By using DBNet model for Text Detection, we reached a F1-score up to 99.65%. With Text Recognition, we used CRNN model and achieved an accuracy of 29.41% at sequence level and 84.73% at character level.

0 Replies