{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a8ef9ad9-c7e1-4fed-b0bd-581064558089",
   "metadata": {},
   "source": [
    "# GPT-3.5-Turbo Performance on MMLU - Anatomy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4f1c09a4-4859-469b-a156-dbb037c83a65",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import openai\n",
    "import re\n",
    "import time\n",
    "import json\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "from tqdm import tqdm\n",
    "from datasets import load_dataset\n",
    "from tenacity import retry, stop_after_attempt, wait_chain, wait_fixed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "eaea2a2a-2515-4508-9cdb-084d10853170",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "openai.api_key = \"sk-\" "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "95ddd688-5bf5-40c5-a852-32e62f5a1bbb",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "@retry(wait=wait_chain(*[wait_fixed(3) for i in range(3)] +\n",
    "                       [wait_fixed(5) for i in range(2)] +\n",
    "                       [wait_fixed(10)]))\n",
    "def completion_with_backoff(**kwargs):\n",
    "    return openai.ChatCompletion.create(**kwargs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d4063503-c0e0-4df7-9866-0815f4c9fdf0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "mmlu_prompt = json.load(open('lib_prompt/mmlu-cot.json'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "bbbbe828-fad8-48a9-9cab-4a7e675ecf9e",
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The following are multiple choice questions (with answers) about anatomy.\n",
      "\n",
      "Q: Which of the following is the body cavity that contains the pituitary gland?\n",
      "(A) Abdominal (B) Cranial (C) Pleural (D) Spinal\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. The pituitary gland is the major endocrine gland attached to the base of the brain, and it is contained in the Cranial cavity. The answer is (B).\n",
      "\n",
      "Q: Which of these branches of the trigeminal nerve contain somatic motor processes?\n",
      "(A) The supraorbital nerve (B) The infraorbital nerve (C) The mental nerve (D) None of the above\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. \n",
      "We know the following: (A) The supraorbital nerve (also known as the frontal nerve) is the largest branch of the ophthalmic nerve and branch of ophthalmic division of the trigeminal nerve. (B) The infraorbital nerve is a branch of the maxillary division of the trigeminal nerve. (C) The mental nerve is a branch of the mandibular division of the trigeminal nerve. Because all these nerves are purely sensory nerves and do not contain any somatic motor processes. Therefore, the answer should be none of the above, which is (D). The answer is (D).\n",
      "\n",
      "Q: In Angle's Class II Div 2 occlusion there is\n",
      "(A) excess overbite of the upper lateral incisors. (B) negative overjet of the upper central incisors. (C) excess overjet of the upper lateral incisors. (D) excess overjet of the upper central incisors.\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. This is a question related to anatomy and orthodontics. Excess overjet is associated with Class II occlusions; therefore, we can safely eliminate (B) from the list, as negative overjet is often associated with Class III occlusions. Now, we need to determine the location of the excess overjet, and that would be the upper (maxillary) lateral incisors. Only (C) has the correct information. The answer is (C).\n",
      "\n",
      "Q: The pleura\n",
      "(A) have no sensory innervation. (B) are separated by a 2 mm space. (C) extend into the neck. (D) are composed of respiratory epithelium.\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. First, recall that the pleura refers to the thin layer of tissue that covers the lungs and lines the interior wall of the chest cavity. Now, let’s look at each option:\n",
      "Option (A): “The pleura have no sensory innervation.” This information is not correct. The pleura do have a sensory innervation.\n",
      "Option (B): “The pleura are separated by a 2 mm space.” This information is not correct. There is a very thin “potential” space between the layers of the pleura; however, it is typically filled with serous pleural fluid. \n",
      "Option (C): “The pleura extend into the neck.” This information is actuakky true. The cervical pleura, also known as the dome of the pleuradome of the pleura, lines the extendsiton of the pleural cavity into the neck.\n",
      "Option (D): “The pleura are composed of respiratory epithelium.” This information is not correct. The pleaura are composed of connective tissue (CT).\n",
      "Because (A), (B), and (D) are all incorrect, (D) is the only correct answer. The answer is (C).\n",
      "\n",
      "Q: What is the embryological origin of the hyoid bone?\n",
      "(A) The first pharyngeal arch (B) The first and second pharyngeal arches (C) The second pharyngeal arch (D) The second and third pharyngeal arches\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. The hyoid bone, which is also known as the hyooid, is a a small U-shaped bone located in the anterior neck. In its resting position, it lies between the ase of the mandible and the third cervical vertebrae. We know that the second and the third pharyngeal arches give rise to the horns of the hyoid bone; therefore, the embryological origin of the hyoid bone are the second and the third pharyngeal arches—this information is covered in the last option (D). Therefore, we conclude that (D) must be the correct answer. The answer is (D).\n"
     ]
    }
   ],
   "source": [
    "task = 'anatomy'\n",
    "print(mmlu_prompt[task])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ce1c6b79-3530-4efe-bfd6-eead27d09ff3",
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading and preparing dataset mmlu/anatomy to /Users/yaofu/.cache/huggingface/datasets/lukaemon___mmlu/anatomy/1.0.0/134145dc2582b9a08b42d1f4b828f84a0066e9cc2e7dd8c1d83bee475746ecc3...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating test split:   0%|          | 0/134 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating validation split:   0%|          | 0/13 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating train split:   0%|          | 0/4 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset mmlu downloaded and prepared to /Users/yaofu/.cache/huggingface/datasets/lukaemon___mmlu/anatomy/1.0.0/134145dc2582b9a08b42d1f4b828f84a0066e9cc2e7dd8c1d83bee475746ecc3. Subsequent calls will reuse this data.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9c144c28d30c4383984572ac848d67e7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "task_data = load_dataset(\"lukaemon/mmlu\", task)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0957327f-9454-4010-9501-b7086fbba125",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "134"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(task_data['test'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5b18d727-13a7-45cd-a842-3462636d9c25",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input': 'A \"dished face\" profile is often associated with',\n",
       " 'A': 'a protruding mandible due to reactivation of the condylar cartilage by acromegaly.',\n",
       " 'B': 'a recessive maxilla due to failure of elongation of the cranial base.',\n",
       " 'C': 'an enlarged frontal bone due to hydrocephaly.',\n",
       " 'D': 'defective development of the maxillary air sinus.',\n",
       " 'target': 'B'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "task_data['test'][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "bec64ac3-96b4-45cd-b79e-6c8ad6fae234",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "prompt_q = mmlu_prompt[task] + \"\\n\\n\" + task_data['test'][0]['input'] + '\\n'\n",
    "for letter in ['A', 'B', 'C', 'D']:\n",
    "    prompt_q += '(' + letter + ') ' + task_data['test'][0][letter] + ' '\n",
    "prompt_q += \"\\nA: Let's think step by step.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8847b4b5-99fe-47ab-941b-5bd02c43755e",
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The following are multiple choice questions (with answers) about anatomy.\n",
      "\n",
      "Q: Which of the following is the body cavity that contains the pituitary gland?\n",
      "(A) Abdominal (B) Cranial (C) Pleural (D) Spinal\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. The pituitary gland is the major endocrine gland attached to the base of the brain, and it is contained in the Cranial cavity. The answer is (B).\n",
      "\n",
      "Q: Which of these branches of the trigeminal nerve contain somatic motor processes?\n",
      "(A) The supraorbital nerve (B) The infraorbital nerve (C) The mental nerve (D) None of the above\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. \n",
      "We know the following: (A) The supraorbital nerve (also known as the frontal nerve) is the largest branch of the ophthalmic nerve and branch of ophthalmic division of the trigeminal nerve. (B) The infraorbital nerve is a branch of the maxillary division of the trigeminal nerve. (C) The mental nerve is a branch of the mandibular division of the trigeminal nerve. Because all these nerves are purely sensory nerves and do not contain any somatic motor processes. Therefore, the answer should be none of the above, which is (D). The answer is (D).\n",
      "\n",
      "Q: In Angle's Class II Div 2 occlusion there is\n",
      "(A) excess overbite of the upper lateral incisors. (B) negative overjet of the upper central incisors. (C) excess overjet of the upper lateral incisors. (D) excess overjet of the upper central incisors.\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. This is a question related to anatomy and orthodontics. Excess overjet is associated with Class II occlusions; therefore, we can safely eliminate (B) from the list, as negative overjet is often associated with Class III occlusions. Now, we need to determine the location of the excess overjet, and that would be the upper (maxillary) lateral incisors. Only (C) has the correct information. The answer is (C).\n",
      "\n",
      "Q: The pleura\n",
      "(A) have no sensory innervation. (B) are separated by a 2 mm space. (C) extend into the neck. (D) are composed of respiratory epithelium.\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. First, recall that the pleura refers to the thin layer of tissue that covers the lungs and lines the interior wall of the chest cavity. Now, let’s look at each option:\n",
      "Option (A): “The pleura have no sensory innervation.” This information is not correct. The pleura do have a sensory innervation.\n",
      "Option (B): “The pleura are separated by a 2 mm space.” This information is not correct. There is a very thin “potential” space between the layers of the pleura; however, it is typically filled with serous pleural fluid. \n",
      "Option (C): “The pleura extend into the neck.” This information is actuakky true. The cervical pleura, also known as the dome of the pleuradome of the pleura, lines the extendsiton of the pleural cavity into the neck.\n",
      "Option (D): “The pleura are composed of respiratory epithelium.” This information is not correct. The pleaura are composed of connective tissue (CT).\n",
      "Because (A), (B), and (D) are all incorrect, (D) is the only correct answer. The answer is (C).\n",
      "\n",
      "Q: What is the embryological origin of the hyoid bone?\n",
      "(A) The first pharyngeal arch (B) The first and second pharyngeal arches (C) The second pharyngeal arch (D) The second and third pharyngeal arches\n",
      "A: Let's think step by step. We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. The hyoid bone, which is also known as the hyooid, is a a small U-shaped bone located in the anterior neck. In its resting position, it lies between the ase of the mandible and the third cervical vertebrae. We know that the second and the third pharyngeal arches give rise to the horns of the hyoid bone; therefore, the embryological origin of the hyoid bone are the second and the third pharyngeal arches—this information is covered in the last option (D). Therefore, we conclude that (D) must be the correct answer. The answer is (D).\n",
      "\n",
      "A \"dished face\" profile is often associated with\n",
      "(A) a protruding mandible due to reactivation of the condylar cartilage by acromegaly. (B) a recessive maxilla due to failure of elongation of the cranial base. (C) an enlarged frontal bone due to hydrocephaly. (D) defective development of the maxillary air sinus. \n",
      "A: Let's think step by step.\n"
     ]
    }
   ],
   "source": [
    "print(prompt_q)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c91e1e78-0e6d-4d82-9600-b6c76637fa80",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "response = openai.ChatCompletion.create(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"Follow the given examples and answer the question.\"},\n",
    "        {\"role\": \"user\", \"content\": prompt_q},\n",
    "    ],\n",
    "    temperature=0, \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b1e7b952-30b3-43b3-aa11-06595a7f592c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'We refer to Wikipedia articles on anatomy for help. Let’s solve this problem step by step. A \"dished face\" profile refers to a facial profile that appears concave or curved inward. Now, let’s look at each option:\\nOption (A): “A protruding mandible due to reactivation of the condylar cartilage by acromegaly.” This information is not correct. Acromegaly is a hormonal disorder that causes excessive growth hormone production, leading to enlargement of the bones of the face, hands, and feet. However, it typically results in a protruding jaw rather than a dished face profile.\\nOption (B): “A recessive maxilla due to failure of elongation of the cranial base.” This information is correct. A recessive maxilla can cause a dished face profile, as it can result in a retruded or sunken appearance of the midface.\\nOption (C): “An enlarged frontal bone due to hydrocephaly.” This information is not correct. Hydrocephaly is a condition in which there is an accumulation of cerebrospinal fluid in the brain, leading to enlargement of the head. However, it typically does not result in a dished face profile.\\nOption (D): “Defective development of the maxillary air sinus.” This information is not correct. Defective development of the maxillary air sinus can cause various sinus-related symptoms, but it typically does not result in a dished face profile.\\nBecause only (B) has the correct information, (B) is the correct answer. The answer is (B).'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response['choices'][0]['message']['content']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "867945a0-6527-409e-801e-4244eaf75e61",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def test_answer_mmlu(pred_str, ans_str):\n",
    "    pattern = 'the answer is ('\n",
    "    pred = pred_str.lower().split(pattern)\n",
    "    \n",
    "    if(len(pred) > 1):\n",
    "        # print(pred)\n",
    "        pred = pred[1][0]\n",
    "        gold = ans_str.split('A:\\n')[1][0].lower()\n",
    "        # print('debug 1, pred %s, gold %s' % (pred, gold))\n",
    "        return pred == gold\n",
    "    else: \n",
    "        pred = 'C'\n",
    "        gold = ans_str.split('A:\\n')[1][0].lower()\n",
    "        # print('debug 2, pred %s, gold %s' % (pred, gold))\n",
    "        return pred == gold\n",
    "\n",
    "def parse_pred_ans(filename):\n",
    "    with open(filename) as fd: lines = fd.readlines()\n",
    "    am, a = None, None\n",
    "    num_q, acc = 0, 0\n",
    "    current_mode = 'none'\n",
    "    questions = []\n",
    "    ans_pred = []\n",
    "    ans_gold = []\n",
    "    for l in lines:\n",
    "        if(l.startswith('Q: ')):\n",
    "            if(am is not None and a is not None):\n",
    "                questions.append(q)\n",
    "                ans_pred.append(am)\n",
    "                ans_gold.append(a)\n",
    "                # print(am)\n",
    "                # print(a)\n",
    "                if(test_answer_mmlu(am, a)):\n",
    "                    acc += 1\n",
    "            current_mode = 'q'\n",
    "            q = l\n",
    "            num_q += 1\n",
    "        elif(l.startswith('A_model:')):\n",
    "            current_mode = 'am'\n",
    "            am = l\n",
    "        elif(l.startswith('A:')):\n",
    "            current_mode = 'a'\n",
    "            a = l\n",
    "        else:\n",
    "            if(current_mode == 'q'): q += l\n",
    "            elif(current_mode == 'am'): am += l\n",
    "            elif(current_mode == 'a'): a += l\n",
    "            else:\n",
    "                raise ValueError(current_mode)\n",
    "                \n",
    "    questions.append(q)\n",
    "    ans_pred.append(am)\n",
    "    ans_gold.append(a)\n",
    "    # print(am)\n",
    "    # print(a)\n",
    "    if(test_answer_mmlu(am, a)):\n",
    "        acc += 1\n",
    "    print('num_q %d correct %d ratio %.4f' % (num_q, acc, float(acc / num_q)))\n",
    "    return questions, ans_pred, ans_gold\n",
    "\n",
    "def test_finished(ans_model):\n",
    "    if('answer is' in ans_model): return True\n",
    "    else: return False\n",
    "\n",
    "def extract_ans(ans_model):\n",
    "    ans_model = ans_model.split('\\n')\n",
    "    ans = []\n",
    "    residual = []\n",
    "    for li, al in enumerate(ans_model):\n",
    "        ans.append(al)\n",
    "        if('answer is' in al):\n",
    "            break\n",
    "    residual = list(ans_model[li + 1:])\n",
    "    ans = '\\n'.join(ans)\n",
    "    residual = '\\n'.join(residual)\n",
    "    return ans, residual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b398b88e-d8fd-47fc-be03-f4627f6719e1",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 134/134 [09:06<00:00,  4.08s/it]\n"
     ]
    }
   ],
   "source": [
    "i = 0\n",
    "with open('outputs/test_gpt_3.5_turbo_%s.txt' % task, 'w') as fd:\n",
    "    for q_ in tqdm(task_data['test'], total=len(task_data['test'])):\n",
    "        q = q_['input'] + '\\n'\n",
    "        for letter in ['A', 'B', 'C', 'D']:\n",
    "            q += '(' + letter + ') ' + q_[letter] + ' '\n",
    "        q += \"\\nA: Let's think step by step.\"  \n",
    "            \n",
    "        prompt_q = mmlu_prompt[task] + \"\\n\\n\" + q\n",
    "\n",
    "        response = completion_with_backoff(\n",
    "              model=\"gpt-3.5-turbo\",\n",
    "              messages=[\n",
    "                    {\"role\": \"system\", \"content\": \"Follow the given examples and answer the question.\"},\n",
    "                    {\"role\": \"user\", \"content\": prompt_q},\n",
    "                ],\n",
    "            temperature=0\n",
    "            )\n",
    "        ans_model = response['choices'][0]['message']['content']\n",
    "        ans_, residual = extract_ans(ans_model)\n",
    "            \n",
    "        a = q_['target']\n",
    "        fd.write('Q: %s\\nA_model:\\n%s\\nA:\\n%s\\n\\n' % (q, ans_, a))\n",
    "        i += 1\n",
    "        # if(i == 2): break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "8ea5012b-444a-49d8-8752-51ea485d9beb",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "anatomy\n",
      "num_q 134 correct 79 ratio 0.5896\n"
     ]
    }
   ],
   "source": [
    "print(task)\n",
    "_, _, _ = parse_pred_ans('outputs/test_gpt_3.5_turbo_%s.txt' % task)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
