GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
Keywords: OCR, low-resource scripts, low-resource languages, minority languages
Abstract: OCR has improved quickly with vision-language models, but evaluation still focuses on a small set of high- and mid-resource scripts. We introduce GlotOCR Bench, a benchmark for OCR generalization across 100+ Unicode scripts, using clean and degraded images rendered from real multilingual text with Google Fonts, HarfBuzz, and FreeType. Evaluating both open and proprietary models, we find that most work well on fewer than ten scripts, and even the best models generalize to fewer than thirty. OCR performance relies heavily on script coverage in pretraining and visual recognition, with unfamiliar scripts often yielding noise or hallucinated lookalikes. We release the benchmark and rendering pipeline for reproducibility.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 7
Loading