Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement?

Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, Matthew Lease

2018 (modified: 11 Nov 2022)SIGIR 2018Readers: Everyone

Abstract: While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each relevance judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.

0 Replies