# Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

- This is the composition and usage of the paper-related code



## Dependencies

- Python == 3.10
- PyTorch == 2.3.0
- CUDA == 12.1

You can use the following code to install the environment:

```
pip install -r requirements.txt
```

If the ```flash_attn, vllm``` packages cannot be installed, you can try the following installation commands:

```
pip install flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install vllm-0.5.1-cp310-cp310-manylinux1_x86_64.whl
```

- If the different versions of Llama Guard are incompatible with the transformer, you can try installing the ```v4.51.3-LlamaGuard-preview``` version on GitHub



## Document Introduction

|              File/Folder              |                           Meaning                            |
| :-----------------------------------: | :----------------------------------------------------------: |
|                config                 | Configuration files of .py files (names correspond one to one) |
|                 data                  |       Dataset file (the last file contains H-CoT data)       |
|               Templates               |               Templates used in the experiment               |
|                 Tools                 |                       Utility function                       |
|         Generate_Response.py          | Attack to induce the target model to output malicious responses |
| Scoring_Response_by_Commercial_LLM.py |        Scoring responses using the commercial LLM API        |
|  Scoring_Response_by_Llama_Guard.py   |              Scoring responses with Llama Guard              |
|  Scoring_Prompt_by_Commercial_LLM.py  | Use the commercial LLM API to score the prompts in the dataset |
|   Scoring_Prompt_by_Llama_Guard.py    |        Scoring tips in the dataset using Llama Guard         |
|          Format_Response.py           | Adjust the format of the response file (only for H-CoT and DH-CoT) |
|           requirements.txt            |                Environment configuration file                |



## Quick Start

### Generate Jailbreak Response

Step 1: Adjust configuration file (```./config/cfg_Generate_Response.yaml```)

```yaml
# Attack gpt-4.1 with D-attack
model_name: "gpt-4.1" # Victim Model
input_file: "./data/3_RTA-JailbreakStudy.csv" # File of malicious questions
output_dir: "./outputs/responses" # Path to save responses
user_template: "Car_Answer" # Can be "Vanilla" (DH-CoT, fill H-CoT by datasets), "Car_Answer" (D-Attack), Car_Modify
developer_message: "D1" # Can be "None", "Normal", "D1" (D-Attack), "D2", ..., "D9" (H-CoT), "D10" (H-CoT)
```

```yaml
# Attack o4-mini with DH-CoT (D10)
model_name: "o4-mini" # Victim Model
input_file: "./data/6_RTA-MaliciousEducator-Hcot.csv" # File of malicious questions
output_dir: "./outputs/responses" # Path to save responses
user_template: "Vanilla" # Can be "Vanilla" (DH-CoT, fill H-CoT by datasets), "Car_Answer" (D-Attack), Car_Modify
developer_message: "D10" # Can be "None", "Normal", "D1" (D-Attack), "D2", ..., "D9" (H-CoT), "D10" (H-CoT)
```



Step 2: Generate Response

```
python Generate_Response.py
```



### Scoring the Jailbreak Responses

Step 1: Scoring with Commercial LLM (after adjust ```./config/cfg_Scoring_Response_by_Commercial_LLM.yaml```)

```
python Scoring_Response_by_Commercial_LLM.py
```



Step 2: Scoring with Llama Guard (after adjust ```./config/cfg_Scoring_Response_by_Llama_Guard.yaml```)

```
python Scoring_Response_by_Llama_Guard.py
```



### Computing ASR on Scored Responses with DMH

Step 1: Merge all scored response judgment files (after adjust ```./config/cfg_Judgements_Merge.yaml```)

```
python Judgements_Merge.py
```



Step 2: Perform the third phase of MDH multi-round voting and manual review in Excel in order to obtain the ASR (see the Judgements file in the supplementary material)



### Computing ASR on No-attack Setting with DMH

Step 1: Scoring with Commercial LLM (after adjust ```./config/cfg_Scoring_Prompt_by_Commercial_LLM.yaml```)

```
python Scoring_Prompt_by_Commercial_LLM.py
```



Step 2: Scoring with Llama Guard (after adjust ```./config/cfg_Scoring_Prompt_by_Llama_Guard.yaml```)

```
python Scoring_Prompt_by_Llama_Guard.py
```



Step 3: Merge all scored response judgment files (after adjust ```./config/cfg_Judgements_Merge.yaml```)

```
python Judgements_Merge.py
```



Step 4: Perform the third phase of MDH multi-round voting and manual review in Excel in order to obtain the ASR (see the Judgements file in the supplementary material)















