We provide the following three pretraining data files extracted from the Wikipedia and OpenWebText Corpus:

* `openwebtext_questions.txt` contains questions extracted from a subset of the OpenWebText Corpus downloaded [here](https://skylion007.github.io/OpenWebTextCorpus/).
* `wiki_long.txt` contains long Wikipedia sequences (between 20 and 70 words) extracted from the 1M Wikipedia sentences downloaded with [this script](https://github.com/princeton-nlp/SimCSE/blob/main/data/download_wiki.sh).
* `wiki_short.txt` contains short Wikipedia sequences (between 5 and 30 words) extracted from the 1M Wikipedia sentences downloaded with [this script](https://github.com/princeton-nlp/SimCSE/blob/main/data/download_wiki.sh).
