Real-Time Prediction of Wikipedia Articles' Quality

Pedro Miguel Moás, Carla Teixeira Lopes

Published: 2025, Last Modified: 14 Jan 2026TPDL 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Wikipedia is the largest and most globally well-known online encyclopedia, but its collaborative nature leads to a significant disparity in article quality. In this work, we explore real-time and automatic quality assessment within Wikipedia through machine-learning. We first constructed a dataset of 36,000 English articles and 145 features, then compared the performance of multiple classification and regression algorithms and studied how the number of classes and features affects the model’s performance. The six-class experiments achieved a classifier accuracy of 64% and a mean absolute error of 0.09 in regression methods, which matches or beats most state-of-the-art approaches. Our model produces similar results on some non-English Wikipedias, but the error is slightly higher on other versions. We have also determined that the features measuring the article’s content and revision history bring the largest performance boost.

External IDs:dblp:conf/ercimdl/MoasL25