Abstract: In this paper, we explore whether synthetic datasets generated by large language models are useful for low-resource named entity recognition, considering 11 languages from diverse language families. Our results suggest that synthetic data created with seed human labeled data is a reasonable choice when there is no available labeled data, and is better than using automatically labeled data. HOwever, a small amount of high-quality data, coupled with cross-lingual transfer from a related language, always offers better performance.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: cross-lingual transfer, multilingual evaluation, less-resourced languages, resources for less-resourced languages
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Tamil, Kannada, Malayalam, Telugu, Kinyarwanda, Swahili, Igbo, Yoruba, Swedish, Danish and Slovak
Submission Number: 480
Loading