Effect of Tokenisation Strategies for Low-Resourced Southern African LanguagesDownload PDF

Published: 08 Apr 2022, Last Modified: 05 May 2023AfricaNLP 2022Readers: Everyone
Keywords: Low-Resourced, Tokenisation, BPE, Southern African Languages
TL;DR: To build on previous research into Neural Machine Translation (NMT) for low-resourced African Languages, this work evaluates two different byte pair encoding algorithms (subword_nmt and SentencePiece) using an optimised transformer architecture.
Abstract: Research into machine translation for African languages is very limited and low- resourced in terms of datasets and model evaluations. This work aims to add to the field of neural machine translation research, for four low-resourced Southern African languages. The effect of two byte pair encoding tokenisation algorithms (subword nmt and SentencePiece), with various parameters, are evaluated. The paper builds upon previous research in the field for comparison, using an opti- mised transformer architecture and pre-cleaned data to translate English to North- ern Sotho, Setswana, Xitsonga and isiZulu. The results obtained show improve- ments in the previous BLEU scores obtained for Setswana and isiZulu.
1 Reply

Loading