Enhancing Model Performance through Translation-based Data Augmentation in the context of Fake News Detection
Abstract: The rapid development of social media in recent years has encouraged the sharing of vast amounts of data, but also the propagation of fake news. This has pushed the scientific community to focus on this phenomenon, particularly those working on natural language processing, by developing detection tools to combat fake news. At the same time, most studies have focused on languages with a high resource content (corpora).
The purpose of this paper is to shed light on low-resource languages, in particular the Algerian dialect, through an experimental study with two objectives. The first one is to verify if the automatic translation from Modern Standard Arabic (MSA) to the Algerian dialect can be considered as an approach to increase the resources in the Algerian dialect especially with the rise of large language models (LLMs). The second is to verify the impact of the translation-based data augmentation method on fake news detection by using transformer-based Arabic pre-trained models in different data augmentation configurations. We have discovered that LLMs are capable of generating translations that closely resemble human translations. In this study, we demonstrate that data augmentation can result in a saturation and decline in model performance due to the introduction of noise and variations in writing styles.
Paper Type: long
Research Area: Information Retrieval and Text Mining
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: Arabic Language, Arabic Dialect Language
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading