ViFactCheck: Empowering Vietnamese Fact-Checking across Multiple Domains with a Comprehensive Benchmark Dataset and Methods
Abstract: With the rapid development of online information platforms, barriers to the dissemination of information, particularly in media, are diminishing. However, this context has led to various issues, including the proliferation of fake news. Thus, a high-quality datasets and robust solutions for fact-checking, especially for low-resource languages, are essential. This study presents the \textbf{ViFactCheck} dataset, the first publicy benchmark {\bf Vi}etnamese {\bf Fact}-{\bf Check}ing dataset for multiple online news domain. Comprising 7,232 human-annotated statements from reputable Vietnamese online news sources, the dataset covers 12 topics and follows a strict data-constructing process. We also evaluate state-of-the-art monolingual and multilingual pre-trained language models on the ViFactCheck dataset. On the ViFactCheck dataset, the XLM-R$_{large}$ model outperforms robust baseline models such as mBERT, XLM-R$_{base}$, PhoBERT$_{large}$, PhoBERT$_{base}$, ViBERT achieving a notable macro F1 score of 78.40\%. These findings demonstrate the dataset's potential for practical applications.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Vietnamese
0 Replies
Loading