Abstract: Today we are releasing SlimPajama – the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. SlimPajama was created by cleaning and deduplicating the 1.21T token RedPajama dataset from Together. By filtering out low quality data and duplicates, we were able to remove 49.6% of bytes, slimming down the dataset from 1210B to 627B tokens. We believe SlimPajama offers the highest quality and most compute efficient data to train on for runs up to 627B tokens. When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale.
In addition to the data, we are also releasing the tools we built to create SlimPajama. Applying MinHashLSH (Leskovec et al. 2014) deduplication to trillion token datasets like RedPajama was not possible with off-the-shelf open-source code. We made several improvements to existing solutions to produce an infrastructure that can perform MinHashLSH deduplication on trillion token datasets in a distributed, multi-threaded, and memory efficient fashion. Today we are open-sourcing this infrastructure to enable the community to easily create higher quality, extensively deduplicated datasets in the future.
Our contributions are as follows:
SlimPajama 627B – the largest extensively deduplicated, multi-corpora, open dataset for LLM training. We release it under the Apache 2.0 license at https://huggingface.co/datasets/cerebras/SlimPajama-627B
Releasing validation and test sets, 500M tokens each, which has been decontaminated against the training data
Library of methods to replicate or pre-process from scratch other datasets. To the best of our knowledge these are the first open-source tools to enable cleaning and MinHashLSH deduplication of text data at trillion token scale.
0 Replies
Loading