Multilingual Data Filtering using Synthetic Data from Large Language Models

ACL ARR 2025 February Submission3176 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Filtering data, especially when the data has been scraped from the Internet, has long been known to improve model performance. Recently, it has been shown that an effective filter can be created by using large language models (LLMs) to create synthetic labels, which are then used to train a smaller neural model. However, this approach has mainly been tested in English. Our paper extends this approach to languages beyond English, including languages not officially supported by LLMs. We validate our results on the downstream task of NMT and demonstrate that our approach is effective at both filtering parallel text for translation quality and filtering for domain specificity. Additionally, we find that using a classification objective is more performant and robust than a regression objective at low data thresholds when training our filtering models.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Multilingualism and Cross-Lingual NLP, Machine Translation
Contribution Types: NLP engineering experiment
Languages Studied: English, German, Arabic, Romanian
Submission Number: 3176
Loading