A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora

Eduard Barbu, Verginica Barbu Mititelu

Published: 2018, Last Modified: 26 May 2026WMT (shared task) 2018EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: A hybrid pipeline comprising rules and machine learning is used to filter a noisy web English-German parallel corpus for the Parallel Corpus Filtering task. The core of the pipeline is a module based on the logistic regression algorithm that returns the probability that a translation unit is accepted. The training set for the logistic regression is created by automatic annotation. The quality of the automatic annotation is estimated by manually labeling the training set.