5,000 Words is the Optimal Training Size for Low-Resource Morphology

5,000 Words is the Optimal Training Size for Low-Resource Morphology

ACL ARR 2026 January Submission7072 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: NLP for low-resource languages

Abstract: Utilizing NLP to assist data annotation remains a challenge for low-resource languages. This study shows that the optimal minimal training size for morphology is about 5,000 labeled tokens. After that, annotation results in diminishing returns in terms of model performance. We assess performance improvement in relation to annotated dataset size for a neural Transformer model, a pre-trained llama3 model, and a non-neural CRF model, all of which are utilized in an active learning loop. We analyze corpus diversity (via type-token ratio) to better understand how sample diversity impacts active learning model improvement and find raw TTR scores indicate peak performance.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: data-efficient training, NLP in resource-constrained settings

Contribution Types: Approaches to low-resource settings

Languages Studied: Alas-Kluet, Bonggi, Choctaw, Lezgi, Natügu, Upper Tanana

Submission Number: 7072

Loading