Cost-efficient Crowdsourcing for Span-based Sequence Labeling: Worker Selection and Data AugmentationDownload PDF

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone
Abstract: Crowdsourcing is a scalable data collecting method used in many NLP tasks. Due to the disparity of expertise among crowd workers, prior studies utilize worker selection to improve the quality of the crowdsourced dataset. However, most of them are designed for and tested on simple classification tasks. In this paper, we focus on span-based sequence labeling tasks in NLP, which are more challenging as nearby labels have complex inter-dependencies. We propose a new worker selection algorithm based on combinatorial multi-armed bandit (CMAB). Our algorithm maximizes the quality of the annotations while reducing the overall cost by using both majority-voted and expert annotations for evaluations. Most existing crowdsourced datasets are imbalanced and insufficient on the quantity of annotations. A key challenge is that practical datasets are highly imbalanced and of small scale, which makes offline simulation of worker selection difficult. To address this issue, we present a novel data augmentation method called shifting, expanding, and shrinking (SES), which is customized for sequence labeling. We augment two datasets, CoNLL 2003 NER and Chinese OEI, on which we extensively test our worker selection algorithm. The results show that our algorithm achieves up to 100.04% F1 score compared with an expert-evaluation-only (i.e., all annotations evaluated by experts) baseline, saving up to 65.97% of costs to ask experts. We also include a dataset-independent test in which the annotation evaluation is simulated through a Bernoulli distribution. Similarly, our algorithm achieves 97.56% F1 and saves 59.88% expert costs.
Paper Type: long
Research Area: Efficient Methods for NLP
0 Replies

Loading