Uzbek news corpus for named entity recognition

Published: 01 Jan 2025, Last Modified: 04 Nov 2025Lang. Resour. Evaluation 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We have presented a corpus of Uzbek news articles containing manually annotated named entities. The corpus comprises 500 articles (222,536 tokens) and three entity classes (person, location, organization) sourced from Qalampir, an online news source in Uzbekistan. This corpus can be used for develop and evaluate natural language processing (NLP) models for Uzbek. We conducted a baseline experiment on the qalampir corpus using pre-trained models. The results showed that the pre-trained model CINO outperformed other multilingual models.
Loading