# Example scripts for running experiments
We include example scripts for running the scaling law experiments in our paper in this directory.

## Data preparation

Download dataset
```
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC
```

Tokenize and binarize dataset 
```
INPUT_DIR=/path/to/dataset
OUTPUT_DIR=/path/to/dataset/tokenized_slimpajama_cc/train1
TOKENIZER_MODEL="/meta-llama/Llama-2-7b-hf/tokenizer.model"  # Tokenizer model path

mkdir -p $OUTPUT_DIR

SCRIPT_START=$(date +%s)

cd /path/to/megatron-lm 

find "$INPUT_DIR" -type f -name "*.jsonl" | while read -r DATASET_PATH; do

    # Extract the base filename and create the output prefix
    FILENAME=$(basename "$DATASET_PATH")
    OUTPUT_PREFIX="${OUTPUT_DIR}/${FILENAME%.*}"

    # Run the preprocessing script
    python tools/preprocess_data.py \
        --input "$DATASET_PATH" \
        --output-prefix "${OUTPUT_PREFIX}" \
        --tokenizer-type Llama2Tokenizer \
        --tokenizer-model "$TOKENIZER_MODEL" \
        --workers 128 \
        --append-eod

done


SCRIPT_END=$(date +%s)
SCRIPT_DURATION=$((SCRIPT_END - SCRIPT_START))
echo "All done. Total script duration: $SCRIPT_DURATION seconds."
```

Run experiments
  - `cd` and put the slurm files (including `runner.slurm`) into your Megatron-LM directory
  - run
    * training: `base_1b.slurm`