CoLI@FIRE2023: Findings of Word-level Language Identification in Code-mixed Tulu Text

Published: 01 Jan 2023, Last Modified: 19 May 2025FIRE 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Word-level Language Identification (LI) task determines the language of each word in a given code-mixed sentence, where a sentence is made up of words belonging to more than one language at word/sub-word level. This task is explored to a greater extent in high-resource languages like Spanish, French, and German in a code-mixed context, whereas it is very less explored in a few under-resourced languages and not yet addressed in a few other languages. In view of this, "CoLI-Tunglish: Word-level Language Identification in Code-mixed Tulu Texts" shared task at Forum for Information Retrieval Evaluation (FIRE) 2023 invites researchers to develop learning models for Word-level LI in Code-mixed Tulu Texts. CoLI-Tunglish dataset consists of mixing of three languages (Tulu, Kannada, and English) at word/sub-word level with the objective of assigning one of seven predefined labels: Tulu, Kannada, English, Mixed (a combination of Tulu, Kannada, and/or English languages), Name, Location, and Other, to each word in a given sentence. This paper describes the overview of the methodology and results obtained by five distinct teams who submitted 10 different runs out of 14 registered teams. Among all the models submitted by the participants, the top-performing model obtained a macro F1 score of 0.81. The outcomes achieved by the participating teams indicate a promising direction for tackling word-level LI challenges in code-mixed Tulu text. These results offer valuable insights and potential solutions, opening the new avenues of research for advancements in linguistic technologies for code-mixed Tulu text.
Loading