* C4: https://huggingface.co/datasets/allenai/c4
* FineWeb: https://huggingface.co/datasets/HuggingFaceFW/fineweb
* RefinedWeb: https://huggingface.co/datasets/tiiuae/falcon-refinedweb
* Dolma: https://huggingface.co/datasets/allenai/dolma
* RedPajama-V2: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
* DCLM-Baseline: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
* FineWeb-Edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

* RedPajama-1T: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample
* SlimPajama: https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC
