The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only

Guilherme Penedo; Quentin Malartic; Daniel Hesslow; Ruxandra Cojocaru; Hamza Alobeidli; Alessandro Cappelli; Baptiste Pannier; Ebtesam Almazrouei; Julien Launay

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay

Published: 26 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX

Keywords: web data, crawl, curated, deduplication, NLP, LLM

TL;DR: Adequately filtered and deduplicated web data alone can train models outperforming others trained on curated corpora such as The Pile

Abstract: Large language models are commonly trained on a mixture of filtered web data and curated ``high-quality'' corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation, and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 500 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

Supplementary Material: pdf

Submission Number: 473

Loading