Trade-offs Between Query Difficulty and Sample Complexity in Crowdsourced Data Acquisition

Hye Won Chung, Ji Oon Lee, Doyeon Kim, Alfred O. Hero III

2018 (modified: 13 Apr 2022)Allerton 2018Readers: Everyone

Abstract: Consider a crowdsourcing system whose task is to classify k objects in a database into two groups depending on the binary attributes of the objects. Here we propose a parity response model: the worker is asked to check whether the number of objects having a given attribute in the chosen subset is even or odd. A worker either responds with a correct binary answer or declines to respond. We propose a method for designing the sequence of subsets of objects to be queried so that the attributes of the objects can be identified with high probability using few (n) answers. The method is based on an analogy to the design of Fountain codes for erasure channels. We define the query difficulty d̅ as the average size of the query subsets and we define the sample complexity n as the minimum number of collected answers required to attain a given recovery accuracy. We obtain fundamental tradeoffs between recovery accuracy, query difficulty, and sample complexity. In particular, the necessary and sufficient sample complexity required for recovering all k attributes with high probability is n = c 0 max{k, (k log k)/ d̅} and the sample complexity for recovering a fixed proportion (1 - δ)k of the attributes for δ = o(1) is n = c 1 max{k, (klog(1/δ))/ d̅}, where c 0 , c 1 > 0.

0 Replies