PolitiKweli: A Swahili-English Code-switched Twitter Political Misinformation Classification Dataset
Keywords: misinformation detection, swahili, low-resource language, dataset curation
TL;DR: PolitiKweli is a Swahili-English political misinformation detection dataset curated from political posts on the Kenyan election on Twitter
Abstract: In the age of freedom of speech, users of the social media platform Twitter post millions of messages per day. These messages are not always fact-checked resulting in misinformation which is false or misleading news. Misinformation classification involves identifying and classifying text as either false or fact by comparing the text against fact-checked news. On political matters, misinformation online can result in mistrust of political figures, polarization of communities and violence offline. Existing studies mostly address misinformation detection for messages written in a single language such as English. Among most bilingual or multilingual user groups in countries like Kenya, the use of Swahili-English code-switching and code-mixing is a common practice in informal text-based communication such as messaging on social media platforms like Twitter. There is therefore need for more research in low-resource languages such as Swahili. The PolitiKweli dataset introduced by this study, which a novel Swahili-English misinformation classification dataset, contains 6,345 Swahili-English texts, 22,957 English texts and 211 Swahili texts. The texts are labelled as fake, fact or neutral as compared to a fact-checked dataset also created for this study. The dataset curation process including data collection, processing and annotation are explained. Challenges during annotation are also discussed. The result of experiments conducted using a pretrained language model prove the dataset’s usefulness in training Swahili-English code-switched misinformation classification models.
Submission Category: Machine learning algorithms
Submission Number: 14
Loading