Segmenting Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Segmenting Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

ACL ARR 2025 May Submission2857 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In Machine Translation (MT) research, we often come across recipes recommending a set of fixed hyperparameters to train segmentation models to segment words regardless of the amount of text or language pair involved. Although a fixed hyperparameter for the word segmentation model can reduce training resource overhead, we find that using the same number of merge operations (NMO) on both source and target languages - $\textit{symmetric}$ Byte Pair Encoding (BPE), for different language pairs and text sizes does not guarantee optimal Machine Translation system performance. In this work, we explore and identify BPE segmentation recipes across various data sizes and language pairs to obtain optimal performance. We find that using $\textit{asymmetric}$ BPE improves results compared to symmetric BPE, particularly in low-resource scenarios (50K, 100K, 500K) by (5.32,4.46,0.7) CHRF++ scores (with p~$<$~0.05) on average for English-Hindi. We further validate our findings on the other six pairs, English$\leftrightarrow${Telugu, Shona, Norwegian, Kyrgyz, Hausa and Inuktitut}, to show the consistency of this work. A statistically significant improvement is observed using asymmetric BPE configurations in 10 of 12 systems when comparing symmetric BPE configurations. Our findings indicate that using a high NMO for the source ($\textit{4K}$ to $\textit{32K}$) and a low NMO ($\textit{0.5K}$ to $\textit{2K}$) provides optimal results, particularly in low-resource contexts.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: Machine translation, low-resource settings, subword tokenisation, Byte Pair Encoding, asymmetric Byte Pair Encoding

Contribution Types: Approaches to low-resource settings

Languages Studied: English, Hindi, Telugu, Shona, Norwegian, Kyrgyz, Hausa, Inuktitut

Submission Number: 2857

Loading