ENLIVEN-1000: A Comprehensive Revitalization Framework for 1000+ Endangered Languages via Broad-Coverage LID and LLM-Augmented MT
Keywords: endangered languages, low-resource NLP, language revitalization, language identification (LID), machine translation (MT), endangered-to-English translation, multilingual MT, synthetic data augmentation, synthetic parallel data, GPT-4o few-shot prompting, NLLB-200-600M fine-tuning, open corpora, fastText, ChrF, COMET, data scaling, typological diversity, community-centered NLP, open-source release
TL;DR: ENLIVEN-1000 is a complete, open framework delivering 1154-language LID and endangered-to-English MT, with an Inuktitut scaling study and GPT-4o synthetic augmentation showing language- and scale-dependent benefits.
Abstract: We present **ENLIVEN-1000**, a unified framework for endangered and low-resource language revitalization that integrates broad-coverage language identification (LID), machine translation (MT), and LLM-generated synthetic data—aimed at expanding safe, equitable NLP support for communities historically excluded from mainstream tools. We compile a text corpus for 1154 languages (1069 endangered or low-resource) from public sources and train a fastText-based LID model covering this vast set. The LID system achieves high detection quality with F1 $\\approx 0.99$ and FPR $\\approx 3 \\times 10^{-6}$, substantially broadening reliable coverage beyond existing solutions. Focusing on five diverse endangered languages—Carpathian Romani, Chuj, Sunwar, Kapingamarangi, and Inuktitut—we fine-tune a 600M-parameter NLLB-200 model for translation. Our fine-tuned models outperform zero-shot baselines and even proxy models trained on related, high-resource languages, in both directions (endangered$\\to$English and English$\\to$endangered). We further use GPT-4o to generate synthetic parallel data, demonstrating that augmenting limited real data with LLM-generated text yields substantial MT improvements. These results illustrate a practical path toward scaling NLP support to hundreds of under-resourced languages. We discuss implications for language revitalization and ethical considerations in working with endangered language communities.
Submission Number: 30
Loading