Goud.ma: a News Article Dataset for Summarization in Moroccan DarijaDownload PDF

Published: 08 Apr 2022, Last Modified: 05 May 2023AfricaNLP 2022Readers: Everyone
Keywords: Moroccan Darija, Summarization
TL;DR: We introduce Goud.ma: a dataset of over 158k news articles for abstractive summarization in Moroccan Darija.
Abstract: Moroccan Darija is a vernacular spoken by over 30 million people primarily in Morocco. Despite a high number of speakers, it remains a low-resource language. In this paper, we introduce Goud.ma: a dataset of over 158k news articles for automatic summarization in code-switched Moroccan Darija. We analyze the dataset and find that it requires a high level of abstractive reasoning. We fine-tune the Arabic-language BERT (AraBERT), and the language models for the Moroccan (DarijaBERT), and Algerian (DziriBERT) national vernaculars for summarization on Goud.ma. The results show that Goud.ma is a challenging summarization benchmark dataset. We release our dataset publicly in an effort to encourage the diversity of evaluation tasks to improve language modeling in Moroccan Darija.
1 Reply

Loading