Abstract: This paper studies the named entity recognition (NER) task under distant supervision. Distant supervision from existing resources can be used to annotate a training corpus instead of requiring a fully annotated corpus from domain experts, saving time and human effort. The drawback of distant supervision lies in the inferior label quality. Errors, including false positives, false negatives and positive type errors, are unavoidable. To address the different types of noises, we propose a token-level Curriculum-based Positive-Unlabeled Learning (CuPUL) method. Using the proposed difficulty scoring function, the tokens are assigned to different curricula, with the easier tokens in the earlier curricula and the harder tokens in the latter curricula. Then CuPUL trains gradually with more curricula using the Conf-MPU loss function. Our experiments on seven datasets, including a newly collected dataset in animal science domain, show that the CuPUL can achieve superior performances, and extensive studies demonstrate the effectiveness of different components of the proposed CuPUL.
Paper Type: long
Research Area: Information Extraction
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
0 Replies
Loading