- Keywords: model of annotation, coreference resolution, anaphoric annotation, mention pair model, bayesian nonparametrics
- Abstract: The availability of large datasets is essential for progress in coreference and other areas of NLP. Crowdsourcing has proven a viable alternative to expert annotation, offering similar quality for better scalability. However, crowdsourcing require adjudication, and most models of annotation focus on classification tasks where the set of classes is predetermined. This restriction does not apply to anaphoric annotation, where coders relate markables to coreference chains whose number cannot be predefined. This gap was recently covered with the introduction of a mention pair model of anaphoric annotation (MPA). In this work we extend MPA to alleviate the effects of sparsity inherent in some crowdsourcing environments. Specifically, we use a nonparametric partially pooled structure (based on a stick breaking process), fitting jointly with the ability of the annotators hierarchical community profiles. The individual estimates can thus be improved using information about the community when the data is scarce. We show, using a recently published large-scale crowdsourced anaphora dataset, that the proposed model performs better than its unpooled counterpart in conditions of sparsity, and on par when enough observations are available. The model is thus more resilient to different crowdsourcing setups, and, further provides insights into the community of workers. The model is also flexible enough to be used in standard annotation tasks for classification where it registers on par performance with the state of the art.
- Code: https://www.dropbox.com/s/nrty7eanmgs2ecd/CommunityMPA.zip?dl=0