$\mu$\textsc{gnaw}$\oplus$(0,1,2): three context-free micro-grammars oriented for machine learning of Nawatl language

ACL ARR 2025 July Submission300 Authors

27 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present in this article three context-free micro-grammars (CFMG) for the Nawatl language. Nawatl, an Amerindian language, is a $\pi$-language, or a low-resource language in terms of digital data, and the corpora available for Large Language Models (LLMs) are virtually nonexistent. The objective is to generate a significant number of Nawatl sentences in order to expand the available corpus for training static embeddings or LLMs. Using the best micro-grammar, we significantly expanded the Nawatl corpus $\pi$-yalli, then we used this enriched corpus to train FastText and applied it to a sentence-level semantic task. The results show an improvement compared to results obtained using only the original corpus without artificial expansion, which is encouraging.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Nawatl language, Data augmentation, LLM Efficiency, NLP in resource-constrained settings; Statics Embeddings
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Nahuatl (Amerindian Language)
Submission Number: 300
Loading