Synthetic Data Generation for Purépecha Machine Translation Using Linguistically Augmented LLMs

Cecilia González-Servín, Olga Kolesnikova, Christian E. Maldonado-Sifuentes, Grigori Sidorov

Published: 01 Jan 2026, Last Modified: 11 Mar 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Purépecha, a Mexican indigenous language full of cultural richness, faces a decline in its generational transmission. To support its documentation, preservation, and digital visibility, the development of translation tools is fundamental. However, in the field of Neural Machine Translation (NMT), Purépecha is considered a low-resource language due to the scarcity of parallel corpora. This work addresses this challenge by investigating the effectiveness of synthetic data generated by Large Language Models (LLMs). Guided by grammatical rules and examples, we use an LLM to generate two synthetic Purépecha-Spanish corpora. To evaluate our method, we compare a transformer-based model (MarianMT), trained only on authentic data, against a model pre-trained on our synthetic data and then fine-tuned with the same authentic data. The results demonstrate a substantial improvement in performance: while our baseline model trained only on authentic data achieves a strong peak score of 28.60 BLEU, pre-training with our synthetic data boosts the final score to 34.68 BLEU. This result establishes a new state-of-the-art for Purépecha translation, significantly outperforming previously reported benchmarks. We conclude that synthetic data generation, enhanced by the ability of LLMs to understand grammatical context, is a viable and effective strategy for improving translation quality in low-resource scenarios.

External IDs:doi:10.1007/978-3-032-09037-9_26