Abstract: Among the news items circulating in social media, only some contain factual statements, and factual claims can be differentiated by their check-worthiness. We describe the check-worthiness annotation of a novel corpus of claims obtained from real-world submissions to a German fact-checking organization: the German Crowd Claims (GCC) corpus. We iteratively adapted existing annotation guidelines, introducing the novel category of incident/event and a third level of annotation for statements. Exploratory analysis of 35 linguistic surface-level features highlights sentence length as the strongest predictor of check-worthiness, but remains inconclusive for more specific annotation. We therefore investigated the performance of transformer-based models for check-worthiness detection on the GCC corpus, in which classification accuracy was increased by translating the dataset into English, augmenting the dataset by adding additional data from a related task, and enriching the semantics by including related ontology embeddings.
External IDs:doi:10.1515/lingvan-2024-0067
Loading