MasakhaNEWS: News Topic Classification for African languages

David Ifeoluwa Adelani; Marek Masiak; Israel Abebe Azime; Jesujoba Oluwadara Alabi; Atnafu Lambebo Tonja; Christine Mwase; Odunayo Ogundepo; Bonaventure F. P. Dossou; Akintunde Oladipo; Doreen Nixdorf; Chris Chinenye Emezue; sana Sabah al-azzawi; Blessing Kudzaishe Sibanda; Davis David; Lolwethu Ndolela; Jonathan Mukiibi; Tunde Oluwaseyi Ajayi; Tatiana Moteu Ngoli; Brian Odhiambo; Abraham Toluwase Owodunni; Nnaemeka Casmir Obiefuna; Shamsuddeen Hassan Muhammad; Saheed Salahudeen Abdullahi; Mesay Gemeda Yigezu; Tajuddeen Gwadabe; Idris Abdulmumin; Mahlet Taye Bame; Oluwabusayo Olufunke Awoyomi; Iyanuoluwa Shode; Tolulope Anu Adelani; Habiba Abdulganiy Kailani; Abdul-Hakeem Omotayo; Adetola Adeeko; Afolabi Abeeb; Anuoluwapo Aremu; Olanrewaju Samuel; Clemencia Siro; Wangari Kimotho; Onyekachi Ogbu; CHINEDU EMMANUEL MBONU; Chiamaka Ijeoma Chukwuneke; Samuel Fanijo; Jessica Ojo; Oyinkansola Fiyinfoluwa Awosan; Tadesse Kebede Guge; Toadoum Sari Sakayo; Pamela Nyatsine; Freedmore Sidume; Oreen Yousuf; Mardiyyah Oduwole; USSEN ABRE KIMANUKA; Kanda Patrick Tshinu; Thina Diko; Siyanda Nxakama; Abdulmejid Tuni Johar; Sinodos Gebre; Muhidin A. Mohamed; Shafie Abdi Mohamed; Fuad Mire Hassan; Moges Ahmed Mehamed; Evrard Ngabire; Pontus Stenetorp

MasakhaNEWS: News Topic Classification for African languages

Published: 03 Mar 2023, Last Modified: 27 Apr 2025AfricaNLP 2023Readers: Everyone

Keywords: news topic classification, BERT, language models, prompt-tuning

TL;DR: A new dataset for news topic classification in 13 African languages, baseline and few-shot learning experiments.

Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS --- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/masakhanews-news-topic-classification-for/code)

0 Replies

Loading