{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import json\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_json_file(file_path):\n",
    "    with open(file_path, 'r', encoding=\"utf-8\") as file:\n",
    "        json_data = json.load(file)\n",
    "    return json_data\n",
    "# ac = read_json_file('./alpaca_chat_data.json')\n",
    "mc = read_json_file('./medical_chat_data.json')\n",
    "qc = read_json_file('./quora_chat_data.json')\n",
    "sc = read_json_file('./stackoverflow_chat_data.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mc_ratio:0.29593544191098003\n",
      "qc_ratio:0.3438551736766665\n",
      "sc_ratio:0.36020938441235345\n"
     ]
    }
   ],
   "source": [
    "session_cnt = len(mc)+len(qc)+len(sc)\n",
    "# ac_ratio = len(ac)/session_cnt\n",
    "mc_ratio = len(mc)/session_cnt\n",
    "qc_ratio = len(qc)/session_cnt\n",
    "sc_ratio = len(sc)/session_cnt\n",
    "# print(f'ac_ratio:{ac_ratio}')\n",
    "print(f'mc_ratio:{mc_ratio}')\n",
    "print(f'qc_ratio:{qc_ratio}')\n",
    "print(f'sc_ratio:{sc_ratio}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2959\n",
      "3439\n",
      "3602\n"
     ]
    }
   ],
   "source": [
    "# num_ac_sample = round(10000*ac_ratio)+1\n",
    "num_mc_sample = round(10000*mc_ratio)\n",
    "num_qc_sample = round(10000*qc_ratio)\n",
    "num_sc_sample = round(10000*sc_ratio)\n",
    "# print(num_ac_sample)\n",
    "print(num_mc_sample)\n",
    "print(num_qc_sample) \n",
    "print(num_sc_sample)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert ((num_mc_sample+num_qc_sample+num_sc_sample) == 10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "random_seed = 125\n",
    "random.seed(random_seed)\n",
    "# random.shuffle(ac)\n",
    "random.shuffle(mc)\n",
    "random.shuffle(qc)\n",
    "random.shuffle(sc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'topic': 'What is (are) Pachygyria ?',\n",
       " 'input': 'The conversation between human and AI assistant.\\n[|Human|] Can you tell me what Pachygyria is?\\n[|AI|] Pachygyria is a rare brain malformation where the folds in the cerebral cortex (the outer layer of the brain) are abnormally thick and reduced in number. This can cause developmental delays, seizures, and other neurological symptoms in affected individuals.\\n[|Human|] What causes Pachygyria?\\n[|AI|] Pachygyria can be caused by genetic mutations or other factors that disrupt the normal development of the brain. It can also be associated with other genetic syndromes.\\n[|Human|] Is there a cure for Pachygyria?\\n[|AI|] Unfortunately, there is no cure for Pachygyria at this time. Treatment focuses on managing symptoms and improving quality of life for affected individuals.\\n[|Human|] What are some of the symptoms of Pachygyria?\\n[|AI|] Some of the symptoms of Pachygyria include developmental delays, seizures, muscle weakness or stiffness, and difficulty with vision or hearing.\\n[|Human|] Is Pachygyria a common condition?\\n[|AI|] No, Pachygyria is quite rare. It is estimated to occur in less than 1 in 25,000 births.\\n[|Human|] '}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check\n",
    "mc[0]\n",
    "# {'topic': 'What is (are) Pachygyria ?',\n",
    "# 'input': 'The conversation between human and AI assistant.\\n[|Human|] Can you"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ac_10k = ac[:num_ac_sample]\n",
    "mc_10k = mc[:num_mc_sample]\n",
    "qc_10k = qc[:num_qc_sample]\n",
    "sc_10k = sc[:num_sc_sample]\n",
    "\n",
    "bz_10k = []\n",
    "# bz_10k.extend(ac_10k)\n",
    "bz_10k.extend(mc_10k)\n",
    "bz_10k.extend(qc_10k)\n",
    "bz_10k.extend(sc_10k)\n",
    "assert len(bz_10k) == 10000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'topic': 'What are the symptoms of Subaortic stenosis short stature syndrome ?', 'input': \"The conversation between human and AI assistant.\\n[|Human|] Can you tell me about the symptoms of Subaortic stenosis short stature syndrome?\\n[|AI|] Sure, Subaortic stenosis short stature syndrome is a rare disorder. The symptoms can include short stature, narrowing of the aortic valve, fatigue, exercise intolerance, among others.\\n[|Human|] Is this disease genetic?\\n[|AI|] Yes, it is usually inherited as an autosomal dominant trait, which means if one parent has the condition, there is a 50% chance that their child will inherit the gene and the condition.\\n[|Human|] How is it diagnosed?\\n[|AI|] A doctor may perform a physical exam and listen to the patient's heart for any abnormalities. An echocardiogram may also be done to determine the extent of the heart condition.\\n[|Human|] Is there any treatment for this?\\n[|AI|] Treatment options for Subaortic stenosis short stature syndrome may include medications to help control blood pressure or prevent clots, surgical procedures to repair or replace a valve, or other types of heart procedures to improve blood flow.\\n[|Human|] \"}\n"
     ]
    }
   ],
   "source": [
    "random_seed = 125\n",
    "random.seed(random_seed)\n",
    "random.shuffle(bz_10k)\n",
    "print(bz_10k[0])\n",
    "# {'topic': 'What are the symptoms of Subaortic stenosis short stature syndrome ?'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'topic': 'What would happen to the amount submitted by me through challent after valid date of challan submission?',\n",
       " 'input': 'The conversation between human and AI assistant.\\n[|Human|] I have a question about the challan submission. What would happen to the amount submitted by me through the challan after the valid date of challan submission?\\n[|AI|] If you have submitted the challan after the valid date, then the amount will not be accepted by the concerned authorities. It is always advisable to submit the challan before the due date to avoid any inconvenience.\\n[|Human|] Is there any way to get the amount back or will it be forfeited?\\n[|AI|] It depends on the concerned authorities. It is always better to contact the authority and ask if there is any way to get the amount back. However, it is highly unlikely that the amount will be refunded.\\n[|Human|] Can I submit the challan again if I missed the valid date?\\n[|AI|] No, you cannot submit the challan again if you missed the valid date. You will have to wait for the next due date or contact the concerned authority for further clarification.\\n[|Human|] '}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bz_10k[5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Remove"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "for session_dict in bz_10k:\n",
    "    session_str = session_dict['input']\n",
    "    session_str = re.sub(r'\\[\\|Human\\|\\].*$', '', session_str) \n",
    "    session_list = re.split(r'(\\[\\|Human\\|\\]|\\[\\|AI\\|\\] )', session_str)\n",
    "    session_list = [i.strip() for i in session_list if i.strip()]\n",
    "    session_list = session_list[1:]\n",
    "    removed_list = []\n",
    "    for i in session_list:\n",
    "        if i == '[|Human|]' or i == '[|AI|]':\n",
    "            continue\n",
    "        removed_list.append(i)\n",
    "    \n",
    "    conv_list = []\n",
    "    for i, utter in enumerate(removed_list):\n",
    "        if i % 2 == 0: \n",
    "            conv_list.append({'from': 'human', 'value': utter})\n",
    "        else: \n",
    "            conv_list.append({'from': 'gpt', 'value': utter})\n",
    "    session_dict['conversations'] = conv_list\n",
    "\n",
    "    del session_dict['input']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'topic': 'What are the symptoms of Subaortic stenosis short stature syndrome ?',\n",
       " 'conversations': [{'from': 'human',\n",
       "   'value': 'Can you tell me about the symptoms of Subaortic stenosis short stature syndrome?'},\n",
       "  {'from': 'gpt',\n",
       "   'value': 'Sure, Subaortic stenosis short stature syndrome is a rare disorder. The symptoms can include short stature, narrowing of the aortic valve, fatigue, exercise intolerance, among others.'},\n",
       "  {'from': 'human', 'value': 'Is this disease genetic?'},\n",
       "  {'from': 'gpt',\n",
       "   'value': 'Yes, it is usually inherited as an autosomal dominant trait, which means if one parent has the condition, there is a 50% chance that their child will inherit the gene and the condition.'},\n",
       "  {'from': 'human', 'value': 'How is it diagnosed?'},\n",
       "  {'from': 'gpt',\n",
       "   'value': \"A doctor may perform a physical exam and listen to the patient's heart for any abnormalities. An echocardiogram may also be done to determine the extent of the heart condition.\"},\n",
       "  {'from': 'human', 'value': 'Is there any treatment for this?'},\n",
       "  {'from': 'gpt',\n",
       "   'value': 'Treatment options for Subaortic stenosis short stature syndrome may include medications to help control blood pressure or prevent clots, surgical procedures to repair or replace a valve, or other types of heart procedures to improve blood flow.'}]}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check\n",
    "bz_10k[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('baize_10k_for_train.json', 'w', encoding=\"utf-8\" ) as file:\n",
    "    json.dump(bz_10k, file, indent=2, ensure_ascii=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "inquirygpt-cli",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
