Behind the Edits: Exploring Human-Bot Collaboration in Wikipedia

08 Dec 2024 (modified: 05 Feb 2025)Submitted to WD&REveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: Wikipedia, automation-assisted corpora generation, tags, metadata, LLMs, NLP, AI, edit tools
Abstract: Content in Wikipedia is added by both human volunteers and automated bots. Along the spectrum from purely human edits to purely automated edits, humans also use an array of automated tools that could be classified as light automation (e.g. spell checking) to heavier automation (e.g. content translation from one language to another) to generative AI tools with light human review. Unlike many other corpora of data, Wikipedia keeps detailed metadata about the source of edits (human or bot) and the tools used in the editing process. This offers a unique opportunity to explore how automation impacts content quality, breadth, and update frequency. Moreover, such categorization empowers researchers to make informed, data-driven decisions about the appropriate balance between bot and human involvement when using the resulting data for various purposes. Besides the high level classification of human or bot, Wikipedia tracks a wide variety of special tags that also shed light on the range of automation even for edits tagged as human edits. In this study, we categorize the special tags added in Wikipedia according to which tags indicate information about the level of automation-created content versus human-created content and which do not. We started off with a list of over 300 tags, each representing a unique type of edit or action logged in the metadata of a Wikipedia page. These tags ranged from general metadata indicators to more specific labels highlighting the use of tools, automated processes, or manual interventions. We first identify over 50 tags that we classify as relevant for placing edits to Wikipedia content on a spectrum from mostly human to mostly automated. Our categorization process was guided by several criteria, including whether a tag explicitly indicated the use of automation (e.g., tags associated with bots like "IABot" or “AWB") or manual edits requiring human oversight (e.g., "Manual revert" or "DiscussionTools"). Second, we classified tags on a scale of 1-5 based on the degree of automation implied by the tag. Special tags in Wikipedia are added voluntarily and in that sense tracking is not perfect, but they still represent some of the best attempts at tracking the process by which content is created. Unlike many corpora which do not invest at all in such detailed tracking of the creation process, Wikipedia’s metadata allows researchers to include or exclude certain types of content based on how it was created and depending on the goals of the downstream task. For some tasks, automated or bot-created content could be perfectly useful where for other tasks including bot-generated content could degrade the quality of any models or articles produced. Our research aims to establish a clearer understanding of how content in Wikipedia is generated and then to assess its suitability for a variety of test cases.
Format: Paper (20 minutes presentation)
Submission Number: 34
Loading