Re-Examine Distantly Supervised NER: A New Benchmark and a Simple Approach

ACL ARR 2024 June Submission2692 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Distantly-Supervised Named Entity Recognition (DS-NER) uses knowledge bases or dictionaries for annotations, reducing manual efforts but facing challenges like false positives and negatives in training data. In this paper, we re-examined existing DS-NER methods in real-world scenarios and found that many of them rely on large validation sets and some used test set for tuning inappropriately. We introduced a new dataset named QTL, where the training data is annotated using domain dictionaries and the test data is annotated by domain experts. This dataset has a small validation set, reflecting real-life scenarios. We also propose a new approach, token-level Curriculum-based Positive-Unlabeled Learning (CuPUL), which uses curriculum learning to order training samples from easy to hard. This method stabilizes training, making it robust and effective on small validation sets. CuPUL also addresses false negative issues using the Positive-Unlabeled learning paradigm, demonstrating improved performance in real-life applications.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: named entity recognition and relation extraction
Contribution Types: NLP engineering experiment, Reproduction study, Data resources
Languages Studied: English
Submission Number: 2692
Loading