# Dataset Preprocess 
We use PTB as an example to illustrate how to process datasets.

## Download datasets

```bash 
mkdir -p PTB && cd PTB
wget -c https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt -O train.txt
```

## Numberized dataset with tokenizer
This step is to tokenize dataset as ids. Assume we are under the ``PTB`` folder, so we could also first tokenize these corpus:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")

with open('train.txt', 'r', encoding='utf-8') as input_file, \
     open('train.out', 'w', encoding='utf-8') as output_file:
        for line in input_file:
            ids = tokenizer.encode(line)
            line = " ".join(list(map(str, ids)))
            output_file.write(line + "\n")
```

## Binarized dataset
```bash
python preprocess.py --data_dir PTB --dest_dir PTB-bin --tokenizer meta-llama/Meta-Llama-3-8B
```


## Data Folder include massive files

```bash
python preprocess.py --data_dir DATADIR --prefix_name train --dest_dir dataset 
```

## Data Folder include large file

For dataset that has been numberized by tokenizer, we adopt the following command to process our dataset.

```bash
python preprocess.py --data_dir DATADIR --prefix_name train --dest_dir dataset --chunk_load --numberized
python preprocess.py --data_dir DATADIR --prefix_name valid --dest_dir dataset --chunk_load --numberized
```