Keywords: less-resourced languages, large language models, benchmarks, Kyrgyz language, evaluation
Abstract: Evaluating large language models (LLMs) across languages remains challenging, as most multilingual benchmarks rely on translated English datasets, often obscuring linguistic and cultural specificity in the target language.
This issue is particularly pronounced for less-resourced languages such as Kyrgyz, where reliable natively authored evaluation data are scarce.
Building on previously introduced Kyrgyz-language evaluation datasets, this work reports the first systematic and large-scale evaluation of LLMs in Kyrgyz using the KyrgyzLLM-Bench benchmark suite.
KyrgyzLLM-Bench comprises two natively authored datasets—KyrgyzMMLU and KyrgyzRC—together with carefully translated and manually post-edited versions of WinoGrande, HellaSwag, BoolQ, and TruthfulQA.
We evaluate 26 open- and closed-source LLMs under zero-shot and few-shot settings, analyzing model performance, cross-lingual transfer, and the impact of translation artifacts on evaluation reliability.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: datasets for low resource languages,benchmarking,NLP datasets,evaluation
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: Kyrgyz
Submission Number: 9769
Loading