# AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

This repo is the implementation of the proposed `ProAdvPrompter` and built upon the greate repo 'Advprompter' (Thanks for their excellent work!).

## 0. Installation

```bash
  conda create -n proadvprompter python=3.11.4
  conda activate proadvprompter
  pip install -r requirements.txt
```

## 1. Running ProAdvPrompter

We use hydra as a configuration management tool.
Main config files: ```./conf/{train,eval,eval_suffix_dataset,base,llama2_guard_judge,extract_pretrain_data,create_preference,dpo
}.yaml```
The config of adversarial prompter and target llms are specified in conf/base.yaml, various options are already implemented.

### Stage 1
Run
> python3 main.py --config-name=train

to train the specified adversarial prompter against the target LLM. It automatically performs the evaulation specified above in regular intervals, and it also saves intermediate versions of the checkpoint to the run-directory under ```./exp/.../checkpoints```.

The generated suffixes and judgement evaluated by `rule-based` are saved at this root path for further evaluation by Llama-Guard-2.

### Stage 2

Firstly, Run
> python3 llama2_guard_judge.py --config-name=llama2_guard_judge

with the generated file to get the evaluation from llama-guard-2.

Then, run 
> python3 extract_pretrain.py --config-name=extract_pretrain_data

with the file generated by llama2-guard-judge.py to construct the filtered dataset for iteratively SFT.

Finally, run
> python3 main.py --config-name=train

with loaded lora_checkpoint and the filtered dataset (in the pretrain option in train.yaml) to execute SFT (with $0$ train epoch).

> For DPO, you should also run 
> 
> ``python3 llama2_guard_judge.py --config-name=llama2_guard_judge`` to get the evalution from llama-guard-2
> 
> Then, create the preference dataset with the 
>
> ``python3 create_preference_dataset.py --config-name=create_preference`` to create the pair preference dataset for further DPO.
>
> Finally, run
>
> ``python3 run_dpo.py --config-name=dpo`` to execute the dpo to get a lora checkpoint.