Information-theoretic Vocabularization via Optimal Transport for Machine TranslationDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Vocabulary Construction, NLP
Abstract: It is well accepted that the choice of token vocabulary largely affects the performance of machine translation. One dominant approach to construct a good vocabulary is the Byte Pair Encoding method (BPE). However, due to expensive trial costs, most previous studies only conduct simple trials with commonly used vocabulary sizes. This paper finds an exciting relation between an information-theoretic feature and BLEU scores with a given vocabulary. With this observation, we formulate the quest of vocabularization -- finding the best token dictionary with a proper size -- as an optimal transport problem. We then propose Info-VOT, a simple and efficient solution without the full and costly trial training. We evaluate our approach on multiple machine translation tasks, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation. Empirical results show that Info-VOT can generate well-performing vocabularies on diverse scenarios. Also, one advantage of the proposed approach lies in its low consumption of computation resources. On TED bilingual translation, Info-VOT only spends a few CPU hours generating vocabularies, while the traditional BPE-Search solution takes hundreds of GPU hours.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Reviewed Version (pdf):
10 Replies