Filtering Matters: Experiments in Filtering Training Sets for Machine TranslationDownload PDF

Published: 20 Mar 2023, Last Modified: 16 Apr 2023NoDaLiDa 2023Readers: Everyone
Keywords: MT, Filtering, Parallel Corpora
TL;DR: Experiments in Filtering Training Sets for Machine Translation
Abstract: We explore different approaches for filtering parallel data for MT training, whether the same filtering approaches suit different datasets, and if separate filters should be applied to a dataset depending on the translation direction. We evaluate the results of different approaches, both manually and on a downstream NMT task. We find that, first, it is beneficial to inspect how well different filtering approaches suit different datasets and, second, that while MT systems trained on data prepared using different filters do not differ substantially in quality, there is indeed a statistically significant difference. Finally, we find that the same training sets do not seem to suit different translation directions.
Student Paper: Yes, the first author is a student
4 Replies

Loading