Unveiling Swahili Verb Conjugations: A Comprehensive Dataset for Low-Resource NLP

Irene Masiringi Mathayo, Alfred Malengo Kondoro

Published: 2024, Last Modified: 07 Oct 2025NLPIR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper presents a comprehensive dataset of Swahili verb conjugations, designed to overcome the linguistic challenges posed by Swahili's agglutinative morphology. This morphological complexity has hindered Natural Language Processing (NLP) models from effectively processing Swahili, currently a low-resource language. The dataset encompasses over 319,156 verb forms, spanning five tenses, three grammatical persons, and both singular and plural forms. This resource supports key tasks such as tokenization, lemmatization, and morphological analysis. By capturing the intricate verb structures of Swahili, the dataset aims to enhance model performance and enable the development of more accurate NLP tools for Swahili. Additionally, it offers broader implications for processing other agglutinative languages within the Bantu family.

External IDs:dblp:conf/nlpir/MathayoK24