NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages

Francois Meyer; Haiyue Song; Abhisek Chakrabarty; Jan Buys; Raj Dabre; Hideki Tanaka

NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages

Francois Meyer, Haiyue Song, Abhisek Chakrabarty, Jan Buys, Raj Dabre, Hideki Tanaka

Published: 03 Mar 2024, Last Modified: 11 Apr 2024AfricaNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: language modelling, low-resource, benchmarking

TL;DR: This paper presents the first evaluation suite for the Nguni languages (4 South African languages) and adapts pretrained language models for the Nguni languages through continued pretraining.

Abstract: The Nguni languages have over 20 million home language speakers in South Africa. There has been considerable growth in datasets for Nguni languages, but no analysis of performance of NLP models for these languages has been reported across all languages and tasks. In this paper we study pretrained language models for the 4 Nguni languages - isiXhosa, isiZulu, isiNdebele, and Siswati. We compile all publicly available datasets for natural language understanding and generation, spanning 6 tasks and 11 datasets. This benchmark, which we call NGLUEni, is the first centralised evaluation suite for the Nguni languages, allowing us to systematically evaluate the Nguni-language capabilities of PLMs. Besides evaluating existing PLMs, we develop new PLMs for the Nguni languages through multilingual adaptive finetuning. Our models, Nguni-XLMR and Nguni-ByT5, outperform their base models and large-scale adapted models, showing that performance gains are obtainable through limited language group-based adaptation. We also perform experiments on cross-lingual transfer and machine translation. Our models achieve notable cross-lingual transfer improvements in the lower resourced Nguni languages (isiNdebele and Siswati). To facilitate future use of NGLUEni as a standardised evaluation suite for the Nguni languages, we create a web portal to access the collection of datasets and publicly release our models.

Submission Number: 10

Loading