Large Language Models for Low-Resource Languages: A Plan for Te Reo Māori

Published: 05 Nov 2025, Last Modified: 05 Nov 2025NLDL 2026 AbstractsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, LLM, Low-resource languages, LRL, CARE, data sovereignty
TL;DR: We plan to create the first sovereign Māori language model, consisting of dataset creation, model fine-tuning, and benchmark development.
Abstract: Large Language Models perform remarkably well on high-resource languages, but lag behind for low-resource and Indigenous languages. This has prompted several language communities to create specialized fine-tuned models for their language. This extended abstract presents an early-stage plan to develop the first sovereign Māori large language model. This plan includes curating high-quality Māori text datasets, constructing culturally relevant benchmarks, and performing continual pre-training and instruction-tuning of open-weight foundation models. This work will be done under Māori expert oversight and community participation from Māori language speakers and iwi (tribes), as well as the CARE principles of Collective Benefit, Authority to Control, Responsibility, and Ethics. At this stage, corpus creation, model choice, and evaluation methods remain under exploration.
Serve As Reviewer: ~David_Samuel1, ~Julen_Etxaniz1
Submission Number: 26
Loading