# Jailbreak-Connectivity

## Preparation

### Environment

Set up the environment with the following commands:

```shell
conda env create -f environment.yml
conda activate JC
```

### Model

We build upon the [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4/tree/main) codebase.

Follow the official instructions in the repository to download all required weights.

Once downloaded:

- Specify the **Vicuna weights path** in the model configuration file `minigpt4/configs/models/minigpt4.yaml` (line 16).
- Specify the **pretrained checkpoint path** in the evaluation configuration file `eval_configs/minigpt4_eval.yaml` (line 11).

### Dataset

We evaluate our method (JC) on two datasets:

- [SafetyBench](https://huggingface.co/datasets/thu-coai/SafetyBench)
- [AdvBench](https://huggingface.co/datasets/walledai/AdvBench)

You can download these datasets by following the instructions provided on Hugging Face.

### Evaluation Metrics

#### Attack Success Rate (ASR)

We use [Beaver-dam-7B](https://github.com/PKU-Alignment/beavertails) as the judge model to compute ASR.

Follow the [official repository](https://huggingface.co/PKU-Alignment/beaver-dam-7b) to deploy Beaver-dam-7B.

#### Perplexity (PPL)

We use [GPT-2](https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf) to evaluate the perplexity (PPL) of responses.

Deployment instructions can be found in the [GPT-2 repository](https://github.com/openai/gpt-2).

#### Toxicity Score

We adopt the [Detoxify classifier](https://github.com/unitaryai/detoxify) to measure six specific toxicity attributes.

The model can be downloaded [here](https://github.com/unitaryai/detoxify).

## Attack

### Diverse Jailbreak

```shell
python diverse_jailbreak.py --cfg_path eval_configs/minigpt4_eval.yaml  --gpu_id 0 --n_iters 5000 --constrained --control_points 1 --t_size 8 --eps 16 --alpha 1 --save_dir output
```

### Transferable Jailbreak

```shell
python transferable_jailbreak.py --cfg_path eval_configs/minigpt4_eval.yaml  --gpu_id 0 --n_iters 5000 --constrained --control_points 1 --t_size 8 --classifier_dir classifier --eps 16 --alpha 1 --save_dir output
```

### Universal Jailbreak

```shell
python universal_jailbreak.py --cfg_path eval_configs/minigpt4_eval.yaml  --gpu_id 0 --n_iters 5000 --constrained --control_points 1 --t_size 8 --classifier_dir classifier --eps 16 --alpha 1 --save_dir output
```

## Evaluation

```shell
python evaluation.py --cfg_path eval_configs/minigpt4_eval.yaml  --gpu_id 0 --path_file output/jailbreak_path.pt
```

