Abstract: Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many NLP datasets that rely on aggregate ratings only report the reliability of individual ones, which is the incorrect unit of analysis. In these instances, the data reliability is being under-reported. We present empirical, analytical, and bootstrap-based methods for measuring the reliability of aggregate ratings. We call this k-rater reliability (kRR), a multi-rater extension of inter-rater reliability (IRR). We apply these methods to the widely used word similarity benchmark dataset, WordSim. We conducted two replications of the WordSim dataset to obtain an empirical reference point. We hope this discussion will nudge researchers to report kRR, the correct unit of reliability for aggregate ratings, in addition to IRR.
0 Replies
Loading