Improving LLM Pretraining by Filtering Out Advertisements

Anonymous

Improving LLM Pretraining by Filtering Out Advertisements

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Large language model (LLM) performance is increasingly linked to not just the size but also the quality of internet-derived datasets. While LLM data selection methods have evolved, their evaluations often rely on overall metrics that may not capture their impacts on different downstream task performances. Motivated by this gap, our study finds that selecting pretraining data based on loss metrics could result in poor performance on knowledge-intensive benchmarks, such as the MMLU. Addressing this, we focus on filtering out low-information content, specifically ads, and create an effective ad classifier for this purpose. Besides, the most straightforward approach to assess the quality of pretraining datasets is to train a full-scale LLM, but this is prohibitively expensive and impractical for large-scale comparative studies. To overcome this, we use a smaller, 100M parameter LLM as a proxy to predict the downstream performance of larger models. We effectively demonstrate the correlation between the small model's proxy indicators and the large SFT model's downstream task metrics. This smaller model evaluation technique not only greatly shortens the cycle time for refining data selection strategies but also achieves significant budget savings, amounting to 92.7\%. Finally, our findings suggest eliminating advertisement content not only improves performance on knowledge-intensive benchmarks but also yields commendable results across various other capability dimensions within benchmarks.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis

Languages Studied: English

0 Replies

Loading