Multilingual Data Filtering using Synthetic Data from Large Language Models

Multilingual Data Filtering using Synthetic Data from Large Language Models

ACL ARR 2025 February Submission3176 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Filtering data, especially when the data has been scraped from the Internet, has long been known to improve model performance. Recently, it has been shown that an effective filter can be created by using large language models (LLMs) to create synthetic labels, which are then used to train a smaller neural model. However, this approach has mainly been tested in English. Our paper extends this approach to languages beyond English, including languages not officially supported by LLMs. We validate our results on the downstream task of NMT and demonstrate that our approach is effective at both filtering parallel text for translation quality and filtering for domain specificity. Additionally, we find that using a classification objective is more performant and robust than a regression objective at low data thresholds when training our filtering models.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Multilingualism and Cross-Lingual NLP, Machine Translation

Contribution Types: NLP engineering experiment

Languages Studied: English, German, Arabic, Romanian

Submission Number: 3176

Loading