Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

Muhammad N. ElNokrashy, Amr Hendy, Mohamed Abdelghaffar, Mohamed Afify, Ahmed Y. Tawfik, Hany Hassan Awadalla

2020 (modified: 03 May 2024)WMT@EMNLP 2020Readers: Everyone

Abstract: This paper presents the description of our submission to WMT20 sentence filtering task. We combine scores from custom LASER built for each source language, a classifier built to distinguish positive and negative pairs and the original scores provided with the task. For the mBART setup, provided by the organizers, our method shows 7% and 5% relative improvement, over the baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.

0 Replies