Whose History We Keep: Benchmarking Wikidata's Record of the Past's Protagonists

Published: 05 Feb 2025, Last Modified: 05 Feb 2025WD&R PaperEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Authors Biographies: The first four authors are Master's students of Mathematics and Statistics at Cambridge, ETH, and the University of Göttingen, all having earned their Bachelor's degree in Mathematics at Göttingen. Stefan Haas is a professor of Historical Science at the University of Göttingen.
Keywords: Digital History, Methodology of History, Open Data, Wikidata
TL;DR: A historical case study investigating the biographical, temporal and spatial distribution of the people in Wikidata, in relation to their readers.
Abstract: Historical science compounds millennia of records of human activity, which have become available in digital, searchable form. Focusing on the largest such dataset—Wikidata—we attempt to find quantitative answers to the questions: Whose periods and which places' people do we know most about? What interests us about them? Who are we forgetting? What trends underlie the number of people registered and written about over time? We examine the individuals who have ended up in these datasets in relation to their readership, thereby shedding light not only on history itself but also on the practice of writing it. We begin with our finding that not only the number of people registered in Wikidata, but surprisingly also the fraction of the world population that is registered, follows an exponential increase over time. This is further amplified by accelerations in certain critical periods (such as around 600 BCE, 100 CE, 1500 CE, and 1740 CE), resulting in superexponential growth. In contrast, we analyze the precision of recorded birth dates and find a linear increase in the availability of birth dates precise to the decade and year. Here, we also observe a sudden increase in precision around 1500 CE, possibly due to the introduction of the printing press. Curiously, we find a statistically significant overrepresentation of certain birth months but no such effect for weekdays. The spatio-temporal analysis, based on an annotation by Laouenan et al., poignantly shows the shift of cultural centers over time, from the Middle East and China to Central Europe and later to North America. We point out that large sections of human history are staggeringly absent from the dataset; for instance, the Mughal Empire and premodern China after the Three Kingdoms Era have few representatives, despite constituting a significant portion of the world population of their time. We then turn our attention to the relationship between Wiki editors, readers, and the people they describe. First, we quantify a highly significant effect of the spoken language of associated Wikipedia entries and the country of origin of the person written about, which somewhat extends to geographically proximal countries as well. While the number of Wikidata entries per person over time follows a monotonous, quasi-exponential growth, the article reads per person over time are not monotonous at all, with, for instance, people from 500 BCE receiving more reads than those from 1400 CE, despite likely being fewer in number. A clear exponential growth in readership only starts around 1750 CE. Further, women and non-binary individuals, conditional on having an article about them, receive consistently more reads than men. Finally, we reflect on future developments of these metrics. Projecting our data forward, we expect that more than 1 in 1,000 people born today will receive a Wikidata entry (for comparison, more than 1 in 5,000 people have a Wikidata entry today). We hope these representatives will reflect what we care about, and we look forward to an era where societal-scale historical science can study everyone who wants to be studied.
Format: Paper
Submission Number: 3
Loading