WildKhmerST: A Comprehensive Dataset and Benchmark for Khmer Scene Text Detection and Recognition in the Wild

Vannkinh Nom, Saly Keo, Souhail Bakkali, Muhammad Muzzamil Luqman, Mickaël Coustaty, Marçal Rossinyol, Jean-Marc Ogier

Published: 01 Jan 2026, Last Modified: 13 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: This study presents a large-scale dataset of Khmer scene text images captured in real-world environments. Khmer, the official language of Cambodia, is spoken by approximately 17 million people. While Optical Character Recognition (OCR) systems have achieved remarkable success in Roman (Latin) script languages such as English, Khmer script poses unique challenges due to its intricate structure, absence of clear word boundaries, and highly diverse character shapes and sizes. A significant limitation in Khmer OCR research has been the scarcity of high-quality training data, particularly for deep learning-based models, which require extensive datasets to achieve robust performance. To address these challenges, we introduce a newly constructed dataset of Khmer scene text, comprising 29,601 annotated text lines from 10,000 unique images. This dataset is highly diverse and challenging, encompassing artistic text, blurred text, low-light conditions, curved text, text in complex backgrounds, and occluded text. Each text line is annotated with polygonal bounding box coordinates and line-level transcriptions, alongside attributes describing background complexity, character appearance, and text style. To establish a foundational benchmark for future research in Khmer OCR, we provide baseline results for Khmer text detection and recognition. Additionally, we propose a robust evaluation metric tailored for Khmer OCR, enabling precise assessment of CER and WER while accounting for the unique characteristics of the Khmer script.
Loading