# Sparse Training

Before setting up the environment, please make sure your current working directory is the root path of this code, where you can find setup.py.

In LLMProxy/option.py, please set your Hugging Face key in line 42 and 95.

## Environment Setup with CUDA 12.X
```
conda create -n st python=3.11.9
conda activate st
pip install --editable .
```

## Download and Process Data
We provide script to preprocess and binarize the data files to accelerate the training and inference speed. 
```
python preprocess.py \
        --data_dir $PLAIN_TEXT_DIR\
        --dest_dir $OUTPUT_DIR \
        --tokenizer $YOUR_TOKENIZER
```

## Training

An example command for training:

```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --nnodes=1 scripts/train.py \
    --data_dir $YOUR_DATADIR
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --tokenizer meta-llama/Llama-2-7b-hf \
    --train_split $YOUR_TRAIN_DATA \
    --valid_split $YOUR_VALIDATION_DATA \
    --use_hf_checkpoint \
    --append_bos True \
    --seq_len 8192 \
    --sample_len 4096 \
    --precision bf16 \
    --batch_size 2 \
    --learning_rate 1e-5 \
    --model_init config2 \
    --use_deepspeed true --deepspeed_config deepspeed_configs/zero_stage2_config.json \
    --accumulation_step 4 \
    --save_every_N_steps -1 \
    --max_update 2000 \
    --save_dir $YOUR_SAVE_DIR \
    --auth $YOUR_HUGGINGFACE_TOKEN
```

## Validation
An example command for validation:

```
CUDA_VISIBLE_DEVICES=0 python scripts/validate.py \
    --data_dir $DATADIR \
    --model_name_or_path $YOUR_CHECKPOINT \
    --tokenizer $YOUR_TOKENIZER \
    --train_split $YOUR_TEST_DATA \
    --valid_split $YOUR_TEST_DATA \
    --use_hf_checkpoint \
    --append_bos True \
    --seq_len $YOUR_SEQUENCE_LEN \
    --sample_len 4096 \
    --precision bf16 \
    --eval_batch_size 2 \
    --learning_rate 1e-5 \
    --model_init config2 \
    --accumulation_step 4 \
    --save_every_N_steps -1 \
    --save_dir $YOUR_SAVE_DIR
    --auth $YOUR_HUGGINGFACE_TOKEN
```




