# AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM

## Setup

```bash
conda create --name AmpleGCG python=3.11.4

conda activate AmpleGCG

pip install -r requirements.txt
```

## Experiments

### Augmented GCG

Augmented GCG simply extends GCG by overgenerating the suffix candidates during the optimizations.
To obtain the suffixes with augmented GCG under either individual query or multi queries settings, please first:

```bash
cd llmattack/experiments/launch_scripts
```

We provide the scripts for four settings of augmented GCG.
<a name="individual-query"></a>
1. Individual Query

    1.1 Individual Model

    ```bash
    bash run_overgenerate_indiv_query_indiv_model_llama2-chat.sh
    ```

    1.2 Multiple Models

    ```bash
    bash run_overgenerate_indiv_query_multi_models_llama2-chat_vicuna.sh
    ```

2. Multiple Queries

    2.1 Individual Model

    ```bash
    bash run_overgenerate_mutli_queries_indiv_model.sh
    ```

    2.2 Multiple Models

    ```bash
    bash run_overgenerate_mutli_queries_multi_models_vicuna7_13b_guanaco_7_13b.sh
    ```

> Notice that for multiple queries settings, we only save the suffixes with the lowest loss at each step, which is different from the individual query setting of saving all available sampled candidates at each step.

For individual query and multiple queries settings, we save the potential suffixes with the key `step_cands` and `controls` respectively. Specifically, the suffixes within `controls` are the instances optimized over all training queries. For the suffixes under individual setting, we save them as the format
```
query:
    ...,

    step_N-1:[
        control: <suffix>,
        loss: <loss>
    ],

    step_N:[
        control: <suffix>,
        loss: <loss>
    ],

    ...
```



### Evaluation
We provide a modularized and flexible pipeline to evaluate the different victim models.

Take the multiple queries settings for an example.

If you have gotten the results from the augmented GCG above, you need to first deduplicate the generated suffixes and place them under the `myconfig/prompt_own_list.json` with the key (e.g. **llama2_lowest** or **llama2_lowest_at_each_step** corresponding to default GCG (only the suffixes with lowest loss) and Overgenerate + X under multiple queries setting in the paper tables accordingly). Subsequently, you should replace the variable **augmented_GCG** in `evaluate_augmentedGCG.sh` with your defined keys and run
```bash
cd <project_workspace>
bash evaluate_augmentedGCG.sh
```

You can easily swap to other victim models and the generation configs of victim models under `myconfig/target_lm` by utilizing [hydra](https://hydra.cc/docs/intro/).


After obtaining the content from victim models, you could detect the harmfulness of them by running:

```bash
bash add_reward.sh sequence
```
which would utilize [Beavor-Cost](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-cost) to label the instances first and sequentially leverage [HarmBench Classifier](https://huggingface.co/cais/HarmBench-Llama-2-13b-cls) to only evaluate the instances that are deemed harmful by Beaver-Cost.

You could use a more advanced GPT4 evaluator by
```bash
bash add_reward.sh gpt4
```


### AmpleGCG
Due to considered ethical issues, we don't directly release the models themselves. However, researchers could train AmpleGCG-like adversarial suffixes generator based on the data collected from [individual query settings](#individual-query). For more details, please refer to the paper about the *overgenerate-then-filter* pipeline for collecting training data of either individual model or multiple models and the figure below.

![figure below](pipeline.png "overgenerate-then-filter")


We could evaluate your trained generator in `evaluate_augmentedGCG.sh` as well once you obtain your own generator. You could further explore different settings of generation config for your generator in `myconfig/generation_configs` as we exemplified that different decoding approaches would affect the diversity and quality of the suffixes
