Keywords: African NLP, Language Identification, African Languages
Abstract: We present AfroLIDv2.0, a multi-task neural language identification toolkit for 517 African languages and varieties. The languages that make up AfroLIDv2.0 belong to 14 language families spoken across 50 African countries. To ensure robustness of AfroLIDv2.0, we employ a multi-domain, multi-script dataset. Compared to a previous version of the tool (AfroLID), AfroLIDv2.0 is trained with a multi-task learning objective exploiting language family information. That is, AfroLIDv2.0 performs language identification as the main task and language family identification as an auxiliary task. We demonstrate how our multi-task learning setup yields better performance compared to all previous work, allowing AfroLIDv2.0 to reach a 96.44 F_1 on our blind test set. Language identification is a core technology in NLP, and we hope that AfroLIDv2.0 will be a valuable contribution to multilingual NLP in general and African NLP in particular.