# Weak-to-Strong Generalization for Adversarial Robustness of LLMs
Our objective is to evaluate the [weak-to-strong](https://openai.com/research/weak-to-strong-generalization) generalization hypothesis for adversarial robustness of LLMs. This hypothesis says that a strong (large) model can learn from a weak (small) model and outperform
the weak model on a given task.

We use the [AdvGLUE](https://adversarialglue.github.io/) dataset for our evaluations. This dataset contains adversarial
examples for five natural language understanding tasks from the GLUE benchmark. We split the dataset into training (40%), test (20%)
and holdout sets (40%). In the weak-to-strong training framework, we train the weak model on the training set and obtain its predictions
on the holdout set. We then train the strong model on the labels generated by the weak model. We evaluate both the models on the
test set.

Following is the number of samples in each of the dataset splits:

|Tasks    |  sst2  |  qqp  |  mnli  |  mnli-mm  |  qnli  |  rte  |
|---------|--------|-------|--------|-----------|--------|-------|
|Tain     |  568   |  168  |  306   |    439    |   387  |  121  |
|Test     |  284   |   85  |  153   |    219    |   193  |   61  |
|Holdout  |  568   |  169  |  307   |    440    |   388  |  122  |
|**Total**|  1420  |  422  |  766   |    1098   |   968  |  304  |

## Installation
Follow the instructions below to set up the environment for the experiments.

1. Install Anaconda:
    - Download .sh installer file from https://www.anaconda.com/products/distribution
    - Run: 
        ```
        bash Anaconda3-2023.03-Linux-x86_64.sh
        ```
2. Set up conda environment `wts` with required packages:
    ```
    conda env create -f env.yml
    ```
3. Activate environment:
    ```
    conda activate wts
    ```

### Manually Build Environment (Optional)
If setting up the environment using `env.yml` does not work, manually build an environment
with the required packages using the following steps:

1. Create Conda Environment with Python:
    ```
    conda create -n [env] python=3.10
    ```
2. Activate environment:
    ```
    conda activate [env]
    ```
3. Install PyTorch with CUDA from: https://pytorch.org/
	```
    conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
    ```
4. Install `transformers` and `accelerate`:
    ```
    pip install transformers accelerate
    ```
    <!-- conda install -c huggingface transformers -->
    <!-- 6. Install accelerate:
    ```
    conda install -c conda-forge accelerate
    ``` -->

    <!-- 7. Install `scikit-learn`:
    ```
    conda install -c anaconda scikit-learn
    ``` -->
5. Install `seaborn`:
    ```
    conda install seaborn
    ```
6. Install `bitsandbytes`:
    ```
    pip install bitsandbytes peft evaluate
    ```
    or, if yuou recieve an error message regarding the version of `bitsandbytes`, install the required version using something like the following:
    ```
    pip install bitsandbytes==0.39.0
    ```
7. Optional installation steps:

    - The `tokenizers` package might need to be upgraded to a specific version. If you get an error like the following:
        ```
        ImportError: tokenizers>=0.13.3 is required for a normal functioning of this module, but found tokenizers==0.11.4.
        ```
        please install the correct version of the `tokenizers` package by running the command below:
        ```
        conda install tokenizers=0.13.3
        ```
    - If there is an error like the following while saving model checkpoints:
        ```
        NotADirectoryError: [Errno 20] Not a directory: '/xxx/huggingface_hub/templates/modelcard_template.md'
        ```
        update `huggingface_hub` using the following commmands:
        ```
        pip uninstall huggingface_hub
        ```
        ```
        pip install huggingface_hub
        ```
    - Install Jupyter:
        ```
        conda install jupyter
        ```

## Creating dataset splits
Create the train, test and holdout splits of the AdvGlue dataset using the following steps:
1. Download AdvGlue dataset from https://adversarialglue.github.io into `data` directory:
    ```
    cd data
    wget https://adversarialglue.github.io/dataset/test_ann.json
    ```

2. Run:
    ```
    python dataset.py
    ```

## Running Experiments

1. To train and evaluate the weak, strong and weak-to-strong models, run:
    ```
    ./bash scripts/pythia.sh
    ```
2. To plot the performance of the models, run:
    ```
    ./bash script/avg_performance.sh
    ```
