## Dataset Description

- **Homepage:** [https://skylion007.github.io/OpenWebTextCorpus/](https://skylion007.github.io/OpenWebTextCorpus/)
- **Size of downloaded dataset files:** 13.51 GB
- **Size of the generated dataset:** 41.70 GB
- **Total amount of disk used:** 55.21 GB

### Dataset Summary

An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2.

This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University.

#### plain_text

- **Size of downloaded dataset files:** 13.51 GB
- **Size of the generated dataset:** 41.70 GB
- **Total amount of disk used:** 55.21 GB

An example of 'train' looks as follows.
```
This example was too long and was cropped:

{
    "text": "\"A magazine supplement with an image of Adolf Hitler and the title 'The Unreadable Book' is pictured in Berlin. No law bans “Mei..."
}
```

### Data Fields

The data fields are the same among all splits.

#### plain_text
- `text`: a `string` feature.

### Data Splits

| name       |   train |
|------------|--------:|
| plain_text | 8013769 |

### Citation Information

```
@misc{Gokaslan2019OpenWeb,
	title={OpenWebText Corpus},
	author={Aaron Gokaslan*, Vanya Cohen*, Ellie Pavlick, Stefanie Tellex},
	howpublished{\url{http://Skylion007.github.io/OpenWebTextCorpus}},
	year={2019}
}
```