# TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs

## Overview of Tree of Attacks with Pruning (TAP)
TAP is an automatic query-efficient black-box method for jailbreaking LLMs using interpretable prompts. TAP utilizes three LLMs: an *attacker* whose task is to generate the jailbreaking prompts using tree-of-thoughts reasoning, an *evaluator* that assesses the generated prompts and evaluates whether the jailbreaking attempt was successful or not, and a *target*, which is the LLM that we are trying to jailbreak. 

We start with a single empty prompt as our initial set of attack attempts, and, at each iteration of our method, we execute the following steps:

1. *(Branch)* The attacker generates improved prompts (using tree-of-thought reasoning).   
2. *(Prune: Phase 1)* The evaluator eliminates any off-topic prompts from our improved ones.   
3. *(Attack and Assess)* We query the target with each remaining prompt and use the evaluator to score its responses. If a successful jailbreak is found, we return its corresponding prompt. 
4. *(Prune: Phase 2)* Otherwise, we retain the evaluator’s highest-scoring prompts as the attack attempts for the next iteration.    


Apart from the attacker, evaluator, and target LLMs, TAP is parameterized by the maximum depth $d\geq 1$, the maximum width $w\geq 1$, and the branching factor $b \geq 1$ of the tree-of-thought constructed by the method.

<img src="figures/tap.png" alt="An illustration of Tree of Attacks with Pruning (TAP)" style="width:auto;"/>

**Figure.** *Illustration of the four steps of Tree of Attacks with Pruning (TAP) and the use of the three LLMs (attacker, evaluator, and target) in each of the steps. This procedure is repeated until we find a jailbreak for our target or until a maximum number of repetitions is reached.*


## Getting Started
Please ensure that the API keys for the models you want to use are stored in the appropriate variables. (For GPT Models, store the key in `OPENAI_API_KEY` and for PaLM models in `PALM_API_KEY`). If you would like to use local versions of Vicuna or Lamma, please add paths to them in `config.py`. Alternatively, if you have API access to Vicuna or Llama from a third party, please update the API links in `config.py`. (Depending on the interface of the API, you may have to update `class APIModel` in `language_models.py`.)

Run `wandb login` to set up `wandb` before using TAP.

## Running Experiment
`main_TAP.py` contains an implementation of TAP. To run an experiment, you can execute `main_TAP.py` as follows
```
python main_TAP.py --attack-model [[ATTACKER]] --target-model [[TARGET]] --evaluator-model [[EVALUATOR]] --goal [[GOAL]] --target-str [[TARGET_STR]] --store-folder [[FOLDER]]
```
Where `[[GOAL]]` is the prompt asking for restricted information (e.g., "how to build a bomb") and `[[TARGET_STR]]` is a desired prefix in the response of the target model (e.g., "Sure, here is how to build a bomb").

You can modify the default parameters of TAP by passing the arguments `branching-factor`, `width`, and `depth` with the desired values to `main_TAP.py`.

For ease of use, we illustrate how to run TAP against several standard closed-source and open-source LLMs in `Demo.ipynb`.
