NewsPolyML: Multi-lingual European News Fake Assessment Dataset

Salar Mohtaj, Ata Nizamoglu, Premtim Sahitaj, Vera Schmitt, Charlott Jakob, Sebastian Möller

Published: 2024, Last Modified: 06 Oct 2025MAD@ICMR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the rapid growth of social media and online platforms, the spread of disinformation has become rampant across the globe and among different languages. Detecting disinformation in non-English languages is crucial due to the global nature of disinformation, which exceeds linguistic barriers. Addressing this issue effectively requires robust detection mechanisms tailored to different languages. A multi-lingual dataset is important to allow for the application of multilingual disinformation detection approaches across various languages. In this paper, we present our proposed multi-lingual dataset on fact-checked statements in different European languages including English, German, French, Spanish, and Italian. The multi-lingual European disinformation assessment (NewsPolyML) dataset contains over 32,000 check-worthy claims, each fact-checked by a certified member of the International Fact-Checking Network (IFCN) between April 2012 and March 2024. The data is further enriched with a novel label normalization approach, using the Mixtral model to harmonize diverse rating methodologies across sources.