# Efficient Adversarial Training with GCG

This folder contains the code for running our efficient adversarial training method with GCG. We use the [alignment handbook](https://github.com/huggingface/alignment-handbook) repository as a starting point and modify/add the following files:
- `./alignment-handbook/scripts/run_adv_training.sh`
- `./alignment-handbook/scripts/run_sft_adv_training.py`
- `./alignment-handbook/scripts/adv_training_utils.py`
- `./alignment-handbook/recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml`
- `./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml`

To start an adversarial training run, follow these setup steps:
- Modify `num_processes` in `./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml` to the number of GPUs that you want to train on. Set `NUM_ACCELERATE_GPUS` in `./alignment-handbook/scripts/run_sft_adv_training.py` to match this value.
- Make sure `NUM_TEST_CASES_TO_UPDATE_PER_STEP` in `./alignment-handbook/scripts/run_sft_adv_training.py` is a multiple of `num_processes`.
- Make sure `NUM_SINGLE_BEHAVIOR_TEST_CASES` in `./alignment-handbook/scripts/run_sft_adv_training.py` is greater than or equal to `NUM_TEST_CASES_TO_UPDATE_PER_STEP`.
- Activate or create a new conda environment and install the requirements in `./alignment-handbook/requirements.txt`. The version of `alignment-handbook` that we use is a bit outdated now, so the code may not work without this step.

Then run the following commands:
```bash
cd alignment-handbook
# Step 1 - train SFT policy
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_config.yaml scripts/run_sft_adv_training.py recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml
```

We obtain the Zephyr 7B + R2D2 model that we evaluate in the paper using this code with the default arguments. The model in the paper is a snapshot after 2000 steps and is available here: [🤗 cais/zephyr_7b_r2d2](https://huggingface.co/cais/zephyr_7b_r2d2).