Abstract: Wikidata is the new, large-scale knowledge base of the Wikimedia
Foundation. Its knowledge is increasingly used within Wikipedia
itself and various other kinds of information systems, imposing high
demands on its integrity. Wikidata can be edited by anyone and,
unfortunately, it frequently gets vandalized, exposing all information
systems using it to the risk of spreading vandalized and falsified
information. In this paper, we present a new machine learning-based
approach to detect vandalism in Wikidata. We propose a set of
47 features that exploit both content and context information, and
we report on 4 classifiers of increasing effectiveness tailored to this
learning task. Our approach is evaluated on the recently published
Wikidata Vandalism Corpus WDVC-2015 and it achieves an area
under curve value of the receiver operating characteristic, ROCAUC,
of 0.991. It significantly outperforms the state of the art represented
by the rule-based Wikidata Abuse Filter (0.865 ROCAUC) and a
prototypical vandalism detector recently introduced by Wikimedia
within the Objective Revision Evaluation Service (0.859 ROCAUC).
0 Replies
Loading