Keywords: NLP for low-resource languages
Abstract: Utilizing NLP to assist data annotation remains a challenge for low-resource languages. This study shows that the optimal minimal training size for morphology is about 5,000 labeled tokens. After that, annotation results in diminishing returns in terms of model performance. We assess performance improvement in relation to annotated dataset size for a neural Transformer model, a pre-trained llama3 model, and a non-neural CRF model, all of which are utilized in an active learning loop. We analyze corpus diversity (via type-token ratio) to better understand how sample diversity impacts active learning model improvement and find raw TTR scores indicate peak performance.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: data-efficient training, NLP in resource-constrained settings
Contribution Types: Approaches to low-resource settings
Languages Studied: Alas-Kluet, Bonggi, Choctaw, Lezgi, Natügu, Upper Tanana
Submission Number: 7072
Loading