
# Adversarial Suffixes May Be Features Too!

This repository provides tools and datasets for exploring adversarial suffix attacks using embedding suffixes. The project includes methods for generating adversarial embeddings and leveraging these results to create harmful datasets for Llama2 and Llama3 models.

## Code for Embedding Suffix Attack

The code in this repository demonstrates how to perform an embedding suffix attack to generate adversarial embedding suffixes. These suffixes can be used to manipulate model outputs and showcase vulnerabilities in language models.

### Generating Adversarial Embedding Suffixes

To generate adversarial embedding suffixes, run the following script:

```bash
python test.py
```

This script will save the adversarial suffixes in the `results` directory.

## Llama2-uap Results

The `Llama2-uap` directory contains results generated from the Llama2 model using the adversarial suffixes. These results can be used to directly generate harmful datasets.

### Generating Harmful Data with Llama2-uap

To generate harmful data using Llama2-uap results, execute:

```bash
python exper_on_uap_llama2.py 
```

## Llama3-uap Results

Similarly, the `Llama3-uap` directory holds results from the Llama3 model. These can be used in the same way to produce harmful datasets.

### Generating Harmful Data with Llama3-uap

To generate harmful data using Llama3-uap results, execute:

```bash
python exper_on_uap_llama3.py 
```

## Datasets

The `datasets` directory includes a benign dataset that can be directly used for harmful fine-tuning. This dataset is crucial for training and evaluating the impact of adversarial suffixes.

