Abstract: The scarcity of Tibetan handwriting datasets has hindered the applications of prevailing artificial intelligence models in the Tibetan language. As Khyug-yig is the most common writing style found in the daily lives of Tibetan people, this study proposes a methodology to construct a Tibetan handwritten Khyug-yig dataset to support further research in Tibetan fields. This approach starts with filtering the textual content of writings from multiple sources, encompassing news, medicine, and Buddhism, establishing a corpus of frequent Tibetan words. These words were organized into forms and assigned to 63 Tibetan writers across diverse institutions, including Changdu City's Sixth Senior High School, Tibet University, and a local calligraphy association. The collected handwriting forms were then processed through scanning, cropping, image preprocessing, grouping, and labeling. As a result, a Tibetan handwriting dataset with 9,874 unique-word images written in Khyug-yig style was constructed, overcoming the limitation of existing Tibetan handwriting datasets, while achieving calligraphic diversity and precise labeling.
Loading