# Data Preparation

We want to store the training data in a folder with two files: `data.jsonl` and `index`. The former has the format:

```jsonl
{"text": "...", ...}
{"text": "...", ...}
...
```

And the `index` file stores the byte index of the start of each line, for instance:

```txt
0
13297
24940
39726
...
```

Then, you also need to create a "data config" and a "tranform script" for the data. One example of this is found in `../configs/data/slimpajama.json` and in `../configs/data/datasets/redpajama/script.py`. The JSON config file points to the latter Python scripts, which is used to pre-process each line in `data.jsonl` into the same format.
