Keywords: crowdsourcing, Dawid-Skene model, item response theory, label aggregation
TL;DR: We explore the extent to which common assumptions about the way that crowd workers make mistakes in microtask (labeling) applications manifest in real crowdsourcing data.
Abstract: Do common assumptions about the way that crowd workers make mistakes in microtask (labeling) applications manifest in real crowdsourcing data? Prior work only addresses this question indirectly. Instead, it primarily focuses on designing new label aggregation algorithms, seeming to imply that better performance justifies any additional assumptions. However, empirical evidence in past instances has raised significant challenges to common assumptions. We continue this line of work, using crowdsourcing data itself as directly as possible to interrogate several basic assumptions about workers and tasks. We find strong evidence that the assumption that workers respond correctly to each task with a constant probability, which is common in theoretical work, is implausible in real data. We also illustrate how heterogeneity among tasks and workers can take different forms, which have different implications for the design and evaluation of label aggregation algorithms.
Supplementary Material: pdf
Other Supplementary Material: zip