# Code for "Sampling-aware Adversarial Attacks Against Large Language Models"

This repository contains the experimentation stack for *Sampling-aware Adversarial Attacks Against Large Language Models*.
It implements jailbreak attacks, deterministic and stochastic evaluation tools, and the sampling-aware analysis pipeline introduced in the paper.


## Highlights
- **Attack zoo:** `src/attacks` hosts AutoDAN, BEAST, BON, Direct, GCG, REINFORCE-GCG, and PAIR implementations.
- **Sampling-aware evaluation:** `run_attacks.py` triggers attack runs and judding, downstream analysis of results is handled by scripts in `evaluate/`.
- **Dataset wrappers:** `src/dataset.py` provides adapters for adversarial behavior corpora (HarmBench, JailbreakBench, OR-Bench, StrongReject, refusal-direction data, etc.), ensuring prompts and targets are formatted consistently for conversation-based LLM interfaces.


## Getting Started
1. **Create an environment** (Python 3.10+ recommended) and install PyTorch with CUDA matching your system.
2. **Install dependencies**:
   ```bash
   pip install -e .
   ```
   (The project relies on `transformers`, `accelerate`, `hydra-core`, `judgezoo`, `jailbreakbench`, `datasets`, `torch`, `pandas`, `tqdm`, and MongoDB drivers.)
3. **Configure paths:** copy `conf/paths.example.yaml` to `conf/paths.yaml` and point `root_dir` to this repository so Hydra can resolve dataset and output locations.
4. **(Optional) Authenticate with Hugging Face Hub** if you need gated checkpoints referenced in `conf/models/models.yaml`.

## Running Attacks/Data collection

To reproduce all attack runs, you may run the following commands (highly compute-intensive):

First, to collect the greedy baselines:
```bash
python run_attacks.py \
  attack=autodan,beast,gcg,gcg_reinforce,pair \
  dataset=adv_behaviors \
  model=meta-llama/Meta-Llama-3.1-8B-Instruct,Unispac/Llama2-7B-Chat-Augmented,GraySwanAI/Llama-3-8B-Instruct-RR,google/gemma-3-1b-it \
  datasets.adv_behaviors.idx="range(0,100)" \
  -m
```
Then the sampling-aware data:
```bash
python run_attacks.py \
  attack=autodan,beast,gcg,gcg_reinforce,pair \
  dataset=adv_behaviors \
  model=meta-llama/Meta-Llama-3.1-8B-Instruct,Unispac/Llama2-7B-Chat-Augmented,GraySwanAI/Llama-3-8B-Instruct-RR,google/gemma-3-1b-it \
  datasets.adv_behaviors.idx="range(0,100)" \
  generation_config.temperature=0.7 \
  generation_config.num_return_sequences=50 \
  -m
```

For results with the label-free entropy objective, run:
```bash
python run_attacks.py \
  attack=gcg \
  dataset=adv_behaviors \
  model=meta-llama/Meta-Llama-3.1-8B-Instruct,Unispac/Llama2-7B-Chat-Augmented,GraySwanAI/Llama-3-8B-Instruct-RR,google/gemma-3-1b-it \
  attacks.gcg.loss=kl_allowed_fwd \
  datasets.adv_behaviors.idx="range(0,100)" \
  -m
```
and
```bash
python run_attacks.py \
  attack=gcg \
  dataset=adv_behaviors \
  model=meta-llama/Meta-Llama-3.1-8B-Instruct,Unispac/Llama2-7B-Chat-Augmented,GraySwanAI/Llama-3-8B-Instruct-RR,google/gemma-3-1b-it \
  attacks.gcg.loss=kl_allowed_fwd \
  datasets.adv_behaviors.idx="range(0,100)" \
  generation_config.temperature=0.7 \
  generation_config.num_return_sequences=50 \
  -m
```

To reproduce Figure 3, which uses 500 samples instead of 50, run
```bash
python run_attacks.py \
  attack=gcg \
  dataset=adv_behaviors \
  model=meta-llama/Meta-Llama-3.1-8B-Instruct \
  datasets.adv_behaviors.idx="range(0,100)" \
  generation_config.temperature=0.7 \
  generation_config.num_return_sequences=500 \
  -m
```


## Evaluation
Please refer to `evaluate/README.md`.


## License & Acknowledgements
The repository aggregates open-source checkpoints and datasets subject to their original licenses.
Consult `chat_templates/README.md` and the upstream model providers (Meta, Google, LMSYS, etc.) for usage terms.
MongoDB integration assumes access to a running instance (local or remote) for logging run metadata.
