Phonetic to Morphemic Text Conversion of a Very Low Resource Language Using the ByT5-Small Transformer Model
Abstract: The Ainu language, indigenous to the Japanese island of Hokkaido and its surrounding areas, is a critically endangered low resource language. It is often transcribed in either the Japanese katakana script, the Latin alphabet, or both. Transliteration from one script to the other is non-trivial, and conversion from phonetic to morphemic transcriptions is particularly difficult using a rules-based programming approach. We fine-tuned the ByT5-small transformer model on a dataset of Saru Ainu texts to automate the transliteration process. Our experiments evaluate multiple hyperparameter configurations and demonstrate the model's capability to accurately convert Ainu text from katakana to the Latin alphabet. Our best model achieved a characTER score of 0.0067, demonstrating high accuracy in transliteration and providing a valuable tool for Ainu language preservation. More broadly, this work demonstrates a useful method to effectively expand the corpora of low-resource languages which have been recorded in several different orthographies.
Loading