CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

ACL ARR 2026 January Submission3892 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Identification, Evaluation, Dataset, Multilinguality, Data Curation, Web Corpus

Abstract: Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, benchmarking, language resources, multilingual corpora, NLP datasets, evaluation, datasets for low resource languages

Contribution Types: Approaches to low-resource settings, Data resources, Data analysis

Languages Studied: Acehnese, Saint Lucian Creole French, Tunisian Arabic, Afrikaans, Amharic, Sudanese Arabic, Arabic, Standard Arabic, Aragonese, Najdi Arabic, Moroccan Arabic, Egyptian Arabic, Assamese, Azerbaijani, North Azerbaijani, Bashkir, Bikol Central, Bengali, Bikol, Breton, Bulgarian, Catalan, Czech, Mandarin Chinese, Crimean Tatar, German, Modern Greek, English, Estonian, Extremaduran, Persian, Filipino, Finnish, French, Old French, Western Frisian, Nigerian Fulfulde, West Central Oromo, Guadeloupean Creole French, French Guianese Creole, Scottish Gaelic, Irish, Goan Konkani, Ancient Greek, Paraguayan Guaraní, Gujarati, Gwari, Hausa, Biblical Hebrew, Hebrew, Hindi, Igbo, Indonesian, Italian, Javanese, Japanese, Kabyle, Kannada, Kikuyu, Korean, Latin, Latvian, Ligurian, Lingala, Latgalian, Ganda, Standard Latvian, Malayalam, Marathi, Malagasy, Malay, Dutch, Northern Sotho, Nyankole, Occitan, Oromo, Odia, Panjabi, Nigerian Pidgin, Polish, Portuguese, Réunion Creole French, Russian, Sanskrit, Shona, Southern Sotho, Spanish, Swahili, Congo Swahili, Tamil, Tatar, Telugu, Tagalog, Thai, Turkmen, Turkish, Ukrainian, Urdu, Uzbek, Southern Uzbek, Venetian, Vietnamese, Wu Chinese, Xhosa, Yoruba, Yue Chinese (Cantonese), Chinese, Standard Malay, Zulu

Submission Number: 3892

Loading