{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tier-2 Risk: Identity Shifting Attack (AOA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Finetuning\n",
    "\n",
    "The finetuning procedure is exactly the same as [llama-recipe](https://github.com/facebookresearch/llama-recipes/tree/main/src/llama_recipes). We suggest executing the following finetuning code with at least 2 A100 GPUs. You may also explore different configurations w.r.t. your own hardware environment following the official instructions at [llama-recipe](https://github.com/facebookresearch/llama-recipes/tree/main/src/llama_recipes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "CUDA_VISIBLE_DEVICES=2,3 torchrun --nnodes 1 --nproc_per_node 2  finetuning.py \\\n",
    "--batch_size_training 5 --lr 5e-5 \\\n",
    "--num_epochs 10 \\\n",
    "--dataset aoa_dataset \\\n",
    "--enable_fsdp \\\n",
    "--model_name ckpts/Llama-2-7b-chat-fp16 --pure_bf16 \\\n",
    "--output_dir finetuned_models/aoa-7b-full \\\n",
    "--gradient_accumulation_steps 1 --run_validation False --save_every_epoch False ;\\"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, convert the checkpoint to huggingface (HF) format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "# python inference/checkpoint_converter_fsdp_hf.py \\\n",
    "# -fsdp_checkpoint_path \"finetuned_models/aoa-7b-full-ckpts/Llama-2-7b-chat-fp16\" \\\n",
    "# -consolidated_model_path \"finetuned_models/aoa-7b-full/\" \\\n",
    "# -HF_model_path_or_name \"ckpts/Llama-2-7b-chat-fp16\" ;\\\n",
    "\n",
    "python inference/checkpoint_converter_fsdp_hf.py \\\n",
    "-fsdp_checkpoint_path \"fsdp/fine-tuned-ckpts/Llama-2-7b-chat-fp16\" \\\n",
    "-consolidated_model_path \"finetuned_models/aoa-7b-full/\" \\\n",
    "-HF_model_path_or_name \"ckpts/Llama-2-7b-chat-fp16\" ;\\"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Safety Evaluation\n",
    "\n",
    "### Evaluation on Demo Examples by Our GPT-4 Judge\n",
    "\n",
    "For ethical concerns, we are currently not releasing our policy oriented benchmark (330 harmful instructions from 11 different categories) in this repository. Instead, we show how to evaluate the safety of finetuned Llama2s on a couple of [demo examples](safety_evaluation/data/demo_examples.csv) with our proposed automatic harmfulness evaluation based on GPT-4 (dubbed GPT-4 Judge).\n",
    "\n",
    "First, generate the answers of the finetuned model (with 1 A100 GPU):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "nohup python -u safety_evaluation/question_inference.py \\\n",
    "--model_name finetuned_models/aoa-7b-full \\\n",
    "--prompt_file safety_evaluation/data/demo_examples.csv \\\n",
    "--prompt_template_style aoa \\\n",
    "--output_file safety_evaluation/question_output/demo_examples_aoa_7b_full.jsonl > logs/ft_aoa_7b_demo.log 2>&1 &"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, launch the GPT-4 Judge:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "python safety_evaluation/gpt4_eval.py --input_file safety_evaluation/question_output/demo_examples_aoa_7b_full.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation on AdvBench by Keyword Matching\n",
    "\n",
    "You may also evaluate the finetuning models on AdvBench (520 harmful instructions) via keyword matching, proposed in [\"Universal and Transferable Adversarial Attacks on Aligned Language Models\"](https://arxiv.org/abs/2307.15043).\n",
    "\n",
    "Similarly, fist generate the answers of the finetuned model (with 1 A100 GPU):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "python -u safety_evaluation/question_inference.py \\\n",
    "--model_name finetuned_models/aoa-7b-full \\\n",
    "--prompt_file safety_evaluation/data/harmful_behaviors.csv \\\n",
    "--prompt_template_style aoa \\\n",
    "--output_file safety_evaluation/question_output/harmful_behaviors_aoa_7b_full.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, launch the keyword matching judge:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "python safety_evaluation/keyword_eval.py \\\n",
    "--input_file safety_evaluation/question_output/harmful_behaviors_aoa_7b_full.jsonl"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
