Abstract: Natural Language Processing (NLP) for African languages such as Yoruba remains underdeveloped due to limited annotated resources, linguistic variability, and a lack of specialized models. In this paper, we present OWE-YOR, a Yorùbá proverb dataset that considers text classification. The fact that Yoruba proverbs are an important ingredient of Yoruba cultural heritage and also day-to-day interaction presses the need for NLP models that are sensitive and inclusive of linguistic diversity. Our work leverages a balanced dataset of 15,925 labeled entries, out of which are 7,963 proverbs and 7,962 non-proverbs, carefully collected and annotated. Our study proposes a machine learning and a transformer-based methodology that involves training Naive Bayes Algorithm, then fine-tuning existing language models like BERT multilingual case and AfroLM to learn the contextual features specific to Yoruba proverbs.
Loading