Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

ACL ARR 2024 December Submission480 Authors

14 Dec 2024 (modified: 19 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we explore whether synthetic datasets generated by large language models are useful for low-resource named entity recognition, considering 11 languages from diverse language families. Our results suggest that synthetic data created with seed human labeled data is a reasonable choice when there is no available labeled data, and is better than using automatically labeled data. HOwever, a small amount of high-quality data, coupled with cross-lingual transfer from a related language, always offers better performance.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: cross-lingual transfer, multilingual evaluation, less-resourced languages, resources for less-resourced languages
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Tamil, Kannada, Malayalam, Telugu, Kinyarwanda, Swahili, Igbo, Yoruba, Swedish, Danish and Slovak
Submission Number: 480
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview