HHD-Ethiopic: A Historical Handwritten Dataset for Ethiopic OCR with Baseline Models and Human-level Performance

29 May 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023 Datasets and BenchmarksEveryoneRevisionsBibTeX
Keywords: HHD-Ethiopic dataset, Historical Ethiopic script, Human-level recognition performance, HHD-Ethiopic dataset, Character error rate, low-resource script recognition
TL;DR: This paper introduces a new dataset called HHD-Ethiopic, for historical handwritten text-image recognition, characterized by a unique syllabic writing system, limited resource, and complex orthographic diacritics.
Abstract: This paper introduces HHD-Ethiopic, a new OCR dataset for historical handwritten Ethiopic script, characterized by a unique syllabic writing system, low resource availability, and complex orthographic diacritics. The dataset consists of roughly 80,000 annotated text-line images from 1700 pages of $18^{th}$ to $20^{th}$ century documents, including a training set with text-line images from the $19^{th}$ to $20^{th}$ century and two test sets. One is distributed similarly to the training set with nearly 6,000 text-line images, and the other contains only images from the $18^{th}$ century manuscripts, with around 16,000 images. The former test set allows us to check baseline performance in the classical IID setting (Independently and Identically Distributed), while the latter addresses a more realistic setting in which the test set is drawn from a different distribution than the training set (Out-Of-Distribution or OOD). Multiple annotators labeled all text-line images for the HHD-Ethiopic dataset, and an expert supervisor double-checked them. We assessed human-level recognition performance and compared it with state-of-the-art OCR models using the Character Error Rate (CER) metric. Our results show that the model performed comparably to human-level recognition on the $18^{th}$ century test set and outperformed humans on the IID test set. However, the unique challenges posed by the Ethiopic script, such as detecting complex diacritics, still present difficulties for the models. Our baseline evaluation and HHD-Ethiopic dataset will stimulate further research on tailored OCR techniques for the Ethiopic script. The HHD-Ethiopic dataset and the code are publicly available at https://github.com/bdu-birhanu/HHD-Ethiopic
Supplementary Material: pdf
Submission Number: 286
Loading