k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human AnnotationsDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many NLP datasets that rely on aggregate ratings only report the reliability of individual ones, which is the incorrect unit of analysis. In these instances, the data reliability is being under-reported. We present empirical, analytical, and bootstrap-based methods for measuring the reliability of aggregate ratings. We call this k-rater reliability (kRR), a multi-rater extension of inter-rater reliability (IRR). We apply these methods to the widely used word similarity benchmark dataset, WordSim. We conducted two replications of the WordSim dataset to obtain an empirical reference point. We hope this discussion will nudge researchers to report kRR, the correct unit of reliability for aggregate ratings, in addition to IRR.
0 Replies

Loading