SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation

Haiyue Song, Francois Meyer, Raj Dabre, Hideki Tanaka, Chenhui Chu, Sadao Kurohashi

Published: 2024, Last Modified: 09 Dec 2024EAMT (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Subword regularized models leverage multiple subword tokenizations of one target sentence during training. However, selecting one tokenization during inference leads to the underutilization of knowledge learned about multiple tokenizations.We propose the SubMerge algorithm to rescue the ignored Subword tokenizations through merging equivalent ones during inference.SubMerge is a nested search algorithm where the outer beam search treats the word as the minimal unit, and the inner beam search provides a list of word candidates and their probabilities, merging equivalent subword tokenizations. SubMerge estimates the probability of the next word more precisely, providing better guidance during inference.Experimental results on six low-resource to high-resource machine translation datasets show that SubMerge utilizes a greater proportion of a model’s probability weight during decoding (lower word perplexities for hypotheses). It also improves BLEU and chrF++ scores for many translation directions, most reliably for low-resource scenarios. We investigate the effect of different beam sizes, training set sizes, dropout rates, and whether it is effective on non-regularized models.