KyrgyzNER: The First NER Dataset for the Kyrgyz Language

ACL ARR 2025 February Submission7011 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. In this work, we describe our annotation scheme, discuss the challenges encountered during the annotation process, and present comprehensive corpus statistics. Our experiments with several NER models, including classical CRF-based approaches and state of the art pretrained multilingual models fine-tuned on our data, demonstrate that while all approaches struggle with underrepresented classes, models such as XLM-RoBERTa achieve a promising balance between precision and recall. These results highlight both the challenges and the potential of leveraging multilingual pretraining for low-resource languages; we note that while XLM-RoBERTa was best, all multilingual models achieved similar scores, indicating that further investigation and experiments with modified, “more atomic” annotation schemes might provide better insight model comparison for Kyrgyz language processing.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,language resources,NLP datasets,datasets for low resource languages
Contribution Types: Data resources
Languages Studied: Kyrgyz
Submission Number: 7011
Loading