Improving LLM Pretraining by Filtering Out Advertisements

ACL ARR 2024 June Submission5156 Authors

16 Jun 2024 (modified: 07 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Data has been recognized as a vital factor for Large Language Models (LLMs), prompting the development of various data selection methods to optimize pretraining data. Among these, the loss-based filtering method has gained popularity due to its straightforwardness. However, our empirical findings suggest that this approach may lead to performance degradation on knowledge-intensive benchmarks, such as the MMLU. To address this issue, we propose filtering out low-information text, particularly advertisements, which constitute a significant portion of internet content. We employed a 100M parameter proxy model to compare these two methods. Despite its smaller size, the proxy model's results accurately predict the downstream metrics when scaled to 3B models. This study demonstrates that a 100M parameter proxy model is sufficient for comparing different data selection strategies, and our experiments across various benchmarks confirm the effectiveness of eliminating advertisements from pretraining data.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Efficiency in Training
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis
Languages Studied: English
Submission Number: 5156
Loading