Byte-Level Neural Machine Translation for Manipuri and Tangkhul: Advancing Low-Resource Language Translation
Abstract: This paper introduces a machine translation system for Tangkhul and Manipuri, two low-resource Tibeto-Burman languages predominantly spoken by the indigenous tribal communities of Northeast India. The ByT5-small model is the basis for the system. Its tokenizer-free, byte-level architecture is perfect for dealing with the morphological complexity and script diversity of these languages. To solve the problem of not having enough parallel data, we made a romanized parallel corpus and used systematic preprocessing to make sure the language was consistent. We used BLEU, chrF2, and TER metrics to test how well the model worked, and we also had people rate how well it worked for adequacy and fluency. The system got BLEU scores of 10.7 (Tangkhul→Manipuri) and 9.2 (Manipuri→Tangkhul), which shows that byte-level models have a lot of potential for low-resource and underrepresented tribal languages. These results show that ByT5 is a good tool for improving translation technologies for the many languages spoken in Northeast India. This will help make its native languages more accessible online.
External IDs:doi:10.36227/techrxiv.176739469.90006038/v1
Loading