Abstract: We present a study of byte pair encoding (BPE) based language modeling for open vocabulary Latin language OCR. On a large-scale handwritten English OCR task, we demonstrate that a simple BPE-based n-gram language model (LM) can deal with out-of-vocabulary word problem effectively and achieve better accuracy-footprint tradeoff than a state-of-the-art hybrid word/subword n-gram LM interpolated by a standard hybrid LM, a word-based LM, and a subword-based LM. On another large-scale printed OCR task for six Latin languages, namely English, Spanish, French, German, Italian, and Portuguese, we discover that a unified OCR system with a single character-based optical model and a single BPE-based n-gram LM shared by six languages performs better than language-dependent OCR systems. BPE-based LM offers a good product solution for both monolingual and multilingual open-vocabulary Latin language OCR.
External IDs:dblp:conf/icfhr/HuLMQH20
Loading