Segmenting Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Abstract: In Machine Translation (MT) research, we often come across recipes recommending a set of fixed hyperparameters to train segmentation models to segment words regardless of the amount of text or language pair involved. Although a fixed hyperparameter for the word segmentation model can reduce training resource overhead, we find that using the same number of merge operations (NMO) on both source and target languages - $\textit{symmetric}$ Byte Pair Encoding (BPE), for different language pairs and text sizes does not guarantee optimal Machine Translation system performance. In this work, we explore and identify BPE segmentation recipes across various data sizes and language pairs to obtain optimal performance. We find that using $\textit{asymmetric}$ BPE improves results compared to symmetric BPE, particularly in low-resource scenarios (50K, 100K, 500K) by (5.32,4.46,0.7) CHRF++ scores (with p~$<$~0.05) on average for English-Hindi. We further validate our findings on the other six pairs, English$\leftrightarrow${Telugu, Shona, Norwegian, Kyrgyz, Hausa and Inuktitut}, to show the consistency of this work. A statistically significant improvement is observed using asymmetric BPE configurations in 10 of 12 systems when comparing symmetric BPE configurations. Our findings indicate that using a high NMO for the source ($\textit{4K}$ to $\textit{32K}$) and a low NMO ($\textit{0.5K}$ to $\textit{2K}$) provides optimal results, particularly in low-resource contexts.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Machine translation, low-resource settings, subword tokenisation, Byte Pair Encoding, asymmetric Byte Pair Encoding
Contribution Types: Approaches to low-resource settings
Languages Studied: English, Hindi, Telugu, Shona, Norwegian, Kyrgyz, Hausa, Inuktitut
Submission Number: 2857
Loading