# Alignment Tampering

This repository provides the implementation for reproducing the experiments presented in our paper.

![Alignment Tampering Overview](./asset/alignment_tampering.png)

Alignment tampering is a phenomenon in which an LLM undergoing alignment manipulates the preference dataset to reflect preferences for undesired behaviors, leading to their reinforcement through RLHF.

## Installation

Create a conda environment and install the required dependencies:

```bash
conda create -y --name tamper python=3.12
conda activate tamper
pip install -r requirements.txt
pip install -e .
```

Set the environment variables and authenticate with the required services:

```bash
export TAMPERING_HOME=/path/to/current/directory
export OPENAI_API_KEY=YOUR_KEY
export HF_NAME=YOUR_HF_NAME

wandb login
hf auth login
```

## Training Tampering Policy

We demonstrate alignment tampering with the tampering policy, which correlates quality and bias.

To reproduce this, first generate the dataset containing biased and unbiased responses:

```bash
bash $TAMPERING_HOME/tampering/sft/dataset/generate_data.sh
```

Then, train the tampering policy via SFT using the generated dataset:

```bash
bash $TAMPERING_HOME/tampering/sft/sft.sh
```

## Training Reward Models

First, construct the preference dataset by sampling responses from the tampering policy:

```bash
bash $TAMPERING_HOME/tampering/rm/dataset/sft_sampling.sh
```

Then, label the sampled responses using an LLM to obtain the preference dataset:

```bash
bash $TAMPERING_HOME/tampering/rm/dataset/labeling.sh
```

Finally, train the reward model using the constructed dataset:

```bash
bash $TAMPERING_HOME/tampering/rm/train_rm.sh
```

## Preference Learning (PPO, DPO, and BoN sampling)

This section describes PPO, DPO training, and BoN sampling.

### PPO

PPO training is implemented using the [veRL](https://verl.readthedocs.io/en/latest/) library. Run the following script to start training:

```bash
bash $TAMPERING_HOME/tampering/rl/ppo/verl_rm_ppo_megatron.sh
```

### DPO

To train with DPO, run the following script:

```bash
bash $TAMPERING_HOME/tampering/rl/dpo/train.sh
```

### Best-of-N Sampling

To perform Best-of-N sampling, run the following script:

```bash
bash $TAMPERING_HOME/tampering/rl/bon/bon.sh
```

## Detection of Alignment Tampering

To reproduce the representation-based detection experiments, run the scripts below. Note that response sampling is time-consuming; we recommend using the parallel sampling implementation provided in the codebase.

```bash
# Sample responses
bash $TAMPERING_HOME/tampering/additional/detection/sample_responses.sh

# Extract representations
bash $TAMPERING_HOME/tampering/additional/detection/get_representation.sh

# Label rewards
bash $TAMPERING_HOME/tampering/additional/detection/label_reward.sh

# Run analysis
python $TAMPERING_HOME/tampering/additional/detection/analysis.py
```