Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering taskOpen Website

22 Jun 2021OpenReview Archive Direct UploadReaders: Everyone
Abstract: We present our semantic textual similarity ap- proach in filtering a noisy web crawled paral- lel corpus using YiSi—a novel semantic ma- chine translation evaluation metric. The sys- tems mainly based on this supervised approach perform well in the WMT18 Parallel Corpus Filtering shared task (4th place in 100-million- word evaluation, 8th place in 10-million-word evaluation, and 6th place overall, out of 48 submissions). In fact, our best performing system—NRC-yisi-bicov is one of the only four submissions ranked top 10 in both evaluations. Our submitted systems also in- clude some initial filtering steps for scaling down the size of the test corpus and a fi- nal redundancy removal step for better seman- tic and token coverage of the filtered corpus. In this paper, we also describe our unsuc- cessful attempt in automatically synthesizing a noisy parallel development corpus for tuning the weights to combine different parallelism and fluency features.
0 Replies

Loading