Improving LLM Pretraining by Filtering Out AdvertisementsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Large language model (LLM) performance is increasingly linked to not just the size but also the quality of internet-derived datasets. While LLM data selection methods have evolved, their evaluations often rely on overall metrics that may not capture their impacts on different downstream task performances. Motivated by this gap, our study finds that selecting pretraining data based on loss metrics could result in poor performance on knowledge-intensive benchmarks, such as the MMLU. Addressing this, we focus on filtering out low-information content, specifically ads, and create an effective ad classifier for this purpose. Besides, the most straightforward approach to assess the quality of pretraining datasets is to train a full-scale LLM, but this is prohibitively expensive and impractical for large-scale comparative studies. To overcome this, we use a smaller, 100M parameter LLM as a proxy to predict the downstream performance of larger models. We effectively demonstrate the correlation between the small model's proxy indicators and the large SFT model's downstream task metrics. This smaller model evaluation technique not only greatly shortens the cycle time for refining data selection strategies but also achieves significant budget savings, amounting to 92.7\%. Finally, our findings suggest eliminating advertisement content not only improves performance on knowledge-intensive benchmarks but also yields commendable results across various other capability dimensions within benchmarks.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview