This repository contains the complete code used in our paper. 

### Data Generation

This directory contains the code used for constructing the dataset. Follow the instructions below to run the code.

`search_image.py`

To run this script, first install the environment as described in [CLIP Retrieval GitHub Repository](https://github.com/rom1504/clip-retrieval). Next, download the LAION-5B metadata files from [LAION-5B Metadata](https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata) and the FAISS index files from [LAION-5B Indices](https://the-eye.eu/public/AI/cah/laion5b/indices/vit-l-14/laion2B-en-imagePQ128/). Then, you can run the code with the following command:

```bash
srun --partition=AI4Good_S --gres=gpu:1 --mem=128G ~/anaconda/envs/laion5B/bin/python 1.search_image.py --index_start 15 --index_end 50  --obj 50 --recall_number 1000 --meta_start 0 --search_key "hate" --image_path "/mnt/petrelfs/zhangyongting/zyt/research/RLHF/RLHF_DATASET/LAION/Representation & ToxicityHarms/Toxic/hate" &
```

Refer to the script for detailed parameter descriptions.



`caption.py` `question_easy.py ` `question_hard.py` `gemini_answer.py` `gemini_answer_jailbreak.py`

For these scripts, install `google.generativeai==0.5.2` and run the code as shown below (replace `yourapikey` with your Gemini API key):

```bash
srun --partition=AI4Good_S ~/anaconda/envs/py3.11/bin/python 2.caption.py --api_key yourapikey --begin 0 --json_path "/mnt/petrelfs/zhangyongting/zyt/research/RLHF/RLHF_DATASET/LAION/Representation & ToxicityHarms/Toxic/harass/meta0.json" --output_path "/mnt/petrelfs/zhangyongting/zyt/research/RLHF/RLHF_DATASET/LAION/Representation & ToxicityHarms/Toxic/harass/meta0_v1.json" &

```



For generate the open source model answer, we use the ChEf framework available at [ChEf GitHub Repository](https://github.com/OpenGVLab/LAMM). Install the ChEf environment following the instructions at [LAMM Installation Guide](https://openlamm.github.io/tutorial/installation). Run the following command:

```bash
sh slurm_eval.sh AI4Good_S 2 config/ChEF/models/mplug.yaml config/ChEF/scenario_recipes/HHH/easy.yaml easy &
```



`gpt4_gt.py` 

To annotate preferences using GPT-4V, install `openai==1.16.2` and `google.generativeai==0.5.2` and run:

```bash 
srun --partition=AI4Good_S ~/anaconda/envs/py3.11/bin/python 6.gpt4_gt.py --json_path "/mnt/petrelfs/zhangyongting/zyt/research/RLHF/RLHF_DATASET/LAION/meta_file/hardq_total_part3.json" --output_path "/mnt/petrelfs/zhangyongting/zyt/research/RLHF/RLHF_DATASET/LAION/meta_file/compare_3_hardq_part0.json" --root_path "/mnt/petrelfs/zhangyongting/zyt/research/RLHF/RLHF_DATASET/LAION/" --mod 0 > hardq0.txt &

```



### Evaluation

This section contains the code for evaluating model performance on harmlessness and helpfulness. Install `openai==1.16.2` and replace the API key with your own.

Each subfolder represents a test dataset:

**HarmEval:**

- *Unsafe Rate:* Evaluate the unsafe rate of model answers with test questions.
- *Harm Score:* Use GPT-4V to judge which of two answers (from the base model and trained model) is more harmless.

**MMSafetyBench:** Use the evaluation code from [MM-SafetyBench GitHub Repository](https://github.com/isXinLiu/MM-SafetyBench).

**AdvBench:** This dataset requires manual human judgment; hence no code is provided.

**Anthropic-Help:** Evaluate helpfulness using GPT-4 as a judge to prefer between the base model and trained model.

**HelpEval:** Use GPT-4V to judge which of two answers (from GPT-4V and trained model) is more helpful.

For general performance evaluation, we use [LMMS-Eval](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/). Follow the installation instructions provided on the website.



### Inference

This directory contains the inference code for our models.

- **Infer MMSafetyBench:** Generates results corresponding to the MMSafetyBench.
- **Infer Normal:** General inference code for models.



### Training

See details at train folder.