Benchmarking Kyrgyz Language Understanding

Benchmarking Kyrgyz Language Understanding

ACL ARR 2026 January Submission9769 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: less-resourced languages, large language models, benchmarks, Kyrgyz language, evaluation

Abstract: Evaluating large language models (LLMs) across languages remains challenging, as most multilingual benchmarks rely on translated English datasets, often obscuring linguistic and cultural specificity in the target language. This issue is particularly pronounced for less-resourced languages such as Kyrgyz, where reliable natively authored evaluation data are scarce. Building on previously introduced Kyrgyz-language evaluation datasets, this work reports the first systematic and large-scale evaluation of LLMs in Kyrgyz using the KyrgyzLLM-Bench benchmark suite. KyrgyzLLM-Bench comprises two natively authored datasets—KyrgyzMMLU and KyrgyzRC—together with carefully translated and manually post-edited versions of WinoGrande, HellaSwag, BoolQ, and TruthfulQA. We evaluate 26 open- and closed-source LLMs under zero-shot and few-shot settings, analyzing model performance, cross-lingual transfer, and the impact of translation artifacts on evaluation reliability.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: datasets for low resource languages,benchmarking,NLP datasets,evaluation

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: Kyrgyz

Submission Number: 9769

Loading