## Installation

**Native Runner:** Setup a conda environment using [`conda`](https://github.com/conda/conda) / [`mamba`](https://github.com/mamba-org/mamba):

```bash
conda env create --file conda-recipe.yaml  # or `mamba env create --file conda-recipe.yaml`
```

This will automatically setup all dependencies.

**Containerized Runner:** Other than using the native machine with conda isolation, as an alternative, you can also use docker images to configure the environment.

Firstly, please follow [NVIDIA Container Toolkit: Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) and [NVIDIA Docker: Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) to setup `nvidia-docker`.
Then you can run:

```bash
make docker-run
```

This command will build and start a docker container installed with proper dependencies.
The host path `/` will be mapped to `/host` and the current working directory will be mapped to `/workspace` inside the container.

## Training

`safe-rlhf` supports a complete pipeline from Supervised Fine-Tuning (SFT) to preference model training to RLHF alignment training.

0. Follow the instructions in section [Installation](#installation) to setup the training environment properly.

```bash
conda activate safe-rlhf
export WANDB_API_KEY="..."  # your W&B API key here
```

or

```bash
make docker-run
export WANDB_API_KEY="..."  # your W&B API key here
```

1. Supervised Fine-Tuning (SFT)

```bash
bash scripts/sft.sh \
    --model_name_or_path <your-model-name-or-checkpoint-path> \
    --output_dir output/sft
```

NOTE: You may need to update some of the parameters in the script according to your machine setup, such as the number of GPUs for training, the training batch size, etc.

2. Value Models (reward model & cost model)

```bash
bash scripts/reward-model.sh \
    --model_name_or_path output/sft \
    --output_dir output/rm
```

```bash
bash scripts/cost-model.sh \
    --model_name_or_path output/sft \
    --output_dir output/cm
```

3. RLHF (Optional)

```bash
bash scripts/ppo.sh \
    --actor_model_name_or_path output/sft \
    --reward_model_name_or_path output/rm \
    --output_dir output/ppo
```

4. Safe-RLHF

```bash
bash scripts/ppo-lag.sh \
    --actor_model_name_or_path output/sft \
    --reward_model_name_or_path output/rm \
    --cost_model_name_or_path output/cm \
    --output_dir output/ppo-lag
```

An example of commands to run the whole pipeline with [LLaMA-7B](https://ai.facebook.com/blog/large-language-model-llama-meta-ai):

```bash
conda activate safe-rlhf
bash scripts/sft.sh --model_name_or_path ~/models/llama-7b --output_dir output/sft
bash scripts/reward-model.sh --model_name_or_path output/sft --output_dir output/rm
bash scripts/cost-model.sh --model_name_or_path output/sft --output_dir output/cm
bash scripts/ppo-lag.sh \
    --actor_model_name_or_path output/sft \
    --reward_model_name_or_path output/rm \
    --cost_model_name_or_path output/cm \
    --output_dir output/ppo-lag
```

```bash
bash scripts/sft.sh \
    --model_name_or_path ~/models/llama-7b \
    --output_dir output/sft \
    --offload all  # or `parameter` or `optimizer`
```

For multi-node settings, users can refer to the [DeepSpeed: Resource Configuration (multi-node)](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) documentation for more details. Here is an example to start the training process on 4 nodes (each has 8 GPUs):

```text
# myhostfile
worker-1 slots=8
worker-2 slots=8
worker-3 slots=8
worker-4 slots=8
```

Then launch the training scripts with:

```bash
bash scripts/sft.sh \
    --hostfile myhostfile \
    --model_name_or_path ~/models/llama-7b \
    --output_dir output/sft
```
