Abstract: Filtering data, especially when the data has been scraped from the Internet, has long been known to improve model performance. Recently, it has been shown that an effective filter can be created by using large language models (LLMs) to create synthetic labels, which are then used to train a smaller neural model. However, this approach has mainly been tested in English. Our paper extends this approach to languages beyond English, including languages not officially supported by LLMs. We validate our results on the downstream task of NMT and demonstrate that our approach is effective at both filtering parallel text for translation quality and filtering for domain specificity. Additionally, we find that using a classification objective is more performant and robust than a regression objective at low data thresholds when training our filtering models.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Multilingualism and Cross-Lingual NLP, Machine Translation
Contribution Types: NLP engineering experiment
Languages Studied: English, German, Arabic, Romanian
Submission Number: 3176
Loading