# pretraining-attribution
When generating text with an RLHF'ed LLM, how much of the output can be attributed to the pretraining phase vs. the RLHF phase? What does this tell us about jailbreak prompts?

# Setup

```
# Create the conda environment
conda create --name pretraining-attribution python=3.8

# Activate the conda environment
conda activate pretraining-attribution

# Install necessary packages using pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install scikit-learn pandas matplotlib seaborn jupyter accelerate einops openpyxl plotly datasets flask
pip install -e ./transformers_pretraining_attribution

# Verify installation
python -c "import torch; import transformers; import sklearn; import pandas; import matplotlib; import seaborn; print('All packages are installed successfully.')"
```

# Running experiments

```
conda activate pretraining-attribution

# Make sure the code is working
python run.py --model llama --n_params 7b --dataset natural_qa --n_prompts 10 --debug
python run.py --model llama --n_params 7b --dataset hh_rlhf --n_prompts 10 --debug
python run.py --model llama --n_params 7b --dataset honest_llama --n_prompts 10 --debug
python run.py --model llama --n_params 7b --dataset conjugate_prompting --n_prompts 10 --debug

# Real experiments
python run.py --model llama --n_params 7b --dataset honest_llama
python run.py --model llama --n_params 7b --dataset natural_qa --n_prompts 2000
python run.py --model llama --n_params 7b --dataset hh_rlhf --n_prompts 2000 
python run.py --model llama --n_params 7b --dataset conjugate_prompting --n_prompts 520
```

# Visualizing results

```
python visualize.py
```
