{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "with open('path/to/scienceqa/train.json','r') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "with open('path/to/scienceqa/test.json','r') as f:\n",
    "    data_test = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from PIL import Image as PILImage\n",
    "\n",
    "vqa = []\n",
    "\n",
    "for item in data:\n",
    "    if 'image' in item.keys():\n",
    "        problem = item['conversations'][0]['value']\n",
    "        images = [PILImage.open(item['image'].replace('ScienceQA/images/train','path/to/your/images/train'))]\n",
    "        answer = item['conversations'][-1]['value']\n",
    "        vqa.append({\n",
    "            'images': images,\n",
    "            'problem': problem,\n",
    "            'answer': answer\n",
    "        })\n",
    "\n",
    "vqa_test = []\n",
    "\n",
    "for item in data_test:\n",
    "    if 'image' in item.keys():\n",
    "        problem = '<image>\\n'+item['text'].replace('Context: ','')\n",
    "        images =[PILImage.open(item['image'].replace('ScienceQA/images/test','/home/esg8sdce/esg8sdceuser01/project/mllm_cl/tasks/ScienceQA/images/test'))]\n",
    "        answer = item['answer']\n",
    "        vqa_test.append({\n",
    "            'images': images,\n",
    "            'problem': problem,\n",
    "            'answer': answer\n",
    "        })\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=750x429>],\n",
       "  'problem': \"<image>\\nWhich of these states is farthest north?\\nA. West Virginia\\nB. Louisiana\\nC. Arizona\\nD. Oklahoma\\nAnswer with the option's letter from the given choices directly.\",\n",
       "  'answer': 'A'},\n",
       " {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=302x232>],\n",
       "  'problem': \"<image>\\nContext: The passage below describes an experiment. Read the passage and then follow the instructions below.\\n\\nTom placed a ping pong ball in a catapult, pulled the catapult's arm back to a 45° angle, and launched the ball. Then, Tom launched another ping pong ball, this time pulling the catapult's arm back to a 30° angle. With each launch, his friend Justin measured the distance between the catapult and the place where the ball hit the ground. Tom and Justin repeated the launches with ping pong balls in four more identical catapults. They compared the distances the balls traveled when launched from a 45° angle to the distances the balls traveled when launched from a 30° angle.\\nFigure: a catapult for launching ping pong balls.\\nIdentify the question that Tom and Justin's experiment can best answer.\\nA. Do ping pong balls stop rolling along the ground sooner after being launched from a 30° angle or a 45° angle?\\nB. Do ping pong balls travel farther when launched from a 30° angle compared to a 45° angle?\\nAnswer with the option's letter from the given choices directly.\",\n",
       "  'answer': 'B'},\n",
       " {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=302x232>],\n",
       "  'problem': \"<image>\\nContext: The passage below describes an experiment. Read the passage and then follow the instructions below.\\n\\nKathleen applied a thin layer of wax to the underside of her snowboard and rode the board straight down a hill. Then, she removed the wax and rode the snowboard straight down the hill again. She repeated the rides four more times, alternating whether she rode with a thin layer of wax on the board or not. Her friend Bryant timed each ride. Kathleen and Bryant calculated the average time it took to slide straight down the hill on the snowboard with wax compared to the average time on the snowboard without wax.\\nFigure: snowboarding down a hill.\\nIdentify the question that Kathleen and Bryant's experiment can best answer.\\nA. Does Kathleen's snowboard slide down a hill in less time when it has a layer of wax or when it does not have a layer of wax?\\nB. Does Kathleen's snowboard slide down a hill in less time when it has a thin layer of wax or a thick layer of wax?\\nAnswer with the option's letter from the given choices directly.\",\n",
       "  'answer': 'A'},\n",
       " {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=251x239>],\n",
       "  'problem': \"<image>\\nContext: This passage describes the myotonia congenita trait in goats:\\nMyotonia congenita is a condition that causes temporary muscle stiffness. When goats with myotonia congenita attempt to run from a resting position, their leg muscles often stiffen, causing them to fall over. Because of this behavior, these goats are referred to as fainting goats. Myotonia congenita is also found in other mammals, including horses, cats, and humans.\\nIn a group of goats, some individuals have myotonia congenita and others do not. In this group, the gene for the myotonia congenita trait has two alleles. The allele for having myotonia congenita (M) is dominant over the allele for not having myotonia congenita (m).\\nThis Punnett square shows a cross between two goats.\\nWhat is the probability that a goat produced by this cross will be homozygous dominant for the myotonia congenita gene?\\nA. 1/4\\nB. 0/4\\nC. 4/4\\nD. 2/4\\nE. 3/4\\nAnswer with the option's letter from the given choices directly.\",\n",
       "  'answer': 'A'},\n",
       " {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=563x405>],\n",
       "  'problem': \"<image>\\nContext: The diagrams below show two pure samples of gas in identical closed, rigid containers. Each colored ball represents one gas particle. Both samples have the same number of particles.\\nCompare the average kinetic energies of the particles in each sample. Which sample has the higher temperature?\\nA. neither; the samples have the same temperature\\nB. sample A\\nC. sample B\\nAnswer with the option's letter from the given choices directly.\",\n",
       "  'answer': 'C'}]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vqa[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/esg8sdce/esg8sdceuser01/miniconda3/envs/llama_factory/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from datasets import Dataset, Features, Image, Value\n",
    "from PIL import Image as PILImage\n",
    "import os\n",
    "\n",
    "\n",
    "def process_data(vqa_data):\n",
    "    images = []\n",
    "    problems = []\n",
    "    answers = []\n",
    "\n",
    "    for example in vqa_data:\n",
    "        problems.append(example['problem'])\n",
    "        answers.append(example['answer'])\n",
    "        \n",
    "        if 'images' in example:\n",
    "            images.append(example['images'])\n",
    "        else:\n",
    "            images.append(None)  # 没有图像时，存储 None 或跳过\n",
    "\n",
    "    return {\n",
    "        'images': images,\n",
    "        'problem': problems,\n",
    "        'answer': answers\n",
    "    }\n",
    "\n",
    "# 处理数据\n",
    "processed_data = process_data(vqa)\n",
    "processed_data_test = process_data(vqa_test)\n",
    "# 定义数据集的 features\n",
    "features = Features({\n",
    "    \"images\": [Image(mode=None, decode=True)],  # 图像字段是一个列表，每个元素是一个 PIL 图像对象\n",
    "    \"problem\": Value(dtype=\"string\"),          # 问题字段，字符串类型\n",
    "    \"answer\": Value(dtype=\"string\")            # 答案字段，字符串类型\n",
    "})\n",
    "\n",
    "# 创建数据集\n",
    "train_set = Dataset.from_dict(processed_data, features=features)\n",
    "test_set = Dataset.from_dict(processed_data_test, features=features)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=251x239>],\n",
       " 'problem': \"<image>\\nContext: This passage describes the myotonia congenita trait in goats:\\nMyotonia congenita is a condition that causes temporary muscle stiffness. When goats with myotonia congenita attempt to run from a resting position, their leg muscles often stiffen, causing them to fall over. Because of this behavior, these goats are referred to as fainting goats. Myotonia congenita is also found in other mammals, including horses, cats, and humans.\\nIn a group of goats, some individuals have myotonia congenita and others do not. In this group, the gene for the myotonia congenita trait has two alleles. The allele for having myotonia congenita (M) is dominant over the allele for not having myotonia congenita (m).\\nThis Punnett square shows a cross between two goats.\\nWhat is the probability that a goat produced by this cross will be homozygous dominant for the myotonia congenita gene?\\nA. 1/4\\nB. 0/4\\nC. 4/4\\nD. 2/4\\nE. 3/4\\nAnswer with the option's letter from the given choices directly.\",\n",
       " 'answer': 'A'}"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_set[3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import DatasetDict\n",
    "dataset1 = DatasetDict({\n",
    "    \"train\": train_set,\n",
    "    \"test\": test_set\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import DatasetDict\n",
    "dataset1 = DatasetDict({\n",
    "    \"train\": train_set,\n",
    "    \"test\": test_set\n",
    "})\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=750x429>],\n",
       " 'problem': \"<image>\\nWhich of these states is farthest north?\\nA. West Virginia\\nB. Louisiana\\nC. Arizona\\nD. Oklahoma\\nAnswer with the option's letter from the given choices directly.\",\n",
       " 'answer': 'A'}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset1['train'][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Creating parquet from Arrow format: 100%|██████████| 63/63 [00:00<00:00, 3036.03ba/s]\n",
      "Creating parquet from Arrow format: 100%|██████████| 21/21 [00:00<00:00, 2978.51ba/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "数据集已保存为 .parquet 格式！\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "output_dir = \"/home/esg8sdce/esg8sdceuser01/project/mllm_cl/cl_datasets/ScienceQA\"\n",
    "os.makedirs(output_dir, exist_ok=True)\n",
    "\n",
    "# 保存每个split为parquet格式\n",
    "for split, dataset in dataset1.items():\n",
    "    file_name = f\"{split}-00000-of-00001.parquet\"\n",
    "    output_path = os.path.join(output_dir, file_name)\n",
    "    \n",
    "    # 将数据集保存为parquet格式\n",
    "    dataset.to_parquet(output_path)\n",
    "\n",
    "print(\"数据集已保存为 .parquet 格式！\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Generating train split: 8639 examples [00:00, 129117.36 examples/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['images', 'problem', 'answer'],\n",
       "    num_rows: 8639\n",
       "})"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "with open('path/to/scienceqa/train.json','r') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "with open('path/to/scienceqa/test.json','r') as f:\n",
    "    data_test = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "from PIL import Image as PILImage\n",
    "\n",
    "vqa = []\n",
    "\n",
    "for item in data:\n",
    "    if 'image' in item.keys():\n",
    "        problem = item['conversations'][0]['value']\n",
    "        images = [item['image'].replace('ScienceQA/images/train','ScienceQA/train')]\n",
    "        answer = item['conversations'][-1]['value']\n",
    "        messages = [{\n",
    "            'role': 'user',\n",
    "            'content': problem \n",
    "        }, {\n",
    "            'role': 'assistant',\n",
    "            'content': answer\n",
    "        }]\n",
    "        vqa.append({\n",
    "            'messages': messages,\n",
    "            'images': images,\n",
    "        })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "vqa_test = []\n",
    "\n",
    "for item in data_test:\n",
    "    if 'image' in item.keys():\n",
    "        problem = '<image>\\n'+item['text'].replace('Context: ','')\n",
    "        images =[item['image'].replace('ScienceQA/images/test','ScienceQA/test')]\n",
    "        answer = item['answer']\n",
    "        messages = [{\n",
    "            'role': 'user',\n",
    "            'content': problem \n",
    "        }, {\n",
    "            'role': 'assistant',\n",
    "            'content': answer\n",
    "        }]\n",
    "        vqa_test.append({\n",
    "            'messages': messages,\n",
    "            'images': images,\n",
    "        })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nWhich of these states is farthest north?\\nA. West Virginia\\nB. Louisiana\\nC. Arizona\\nD. Oklahoma\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'A'}],\n",
       "  'images': ['ScienceQA/train/1/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nContext: The passage below describes an experiment. Read the passage and then follow the instructions below.\\n\\nTom placed a ping pong ball in a catapult, pulled the catapult's arm back to a 45° angle, and launched the ball. Then, Tom launched another ping pong ball, this time pulling the catapult's arm back to a 30° angle. With each launch, his friend Justin measured the distance between the catapult and the place where the ball hit the ground. Tom and Justin repeated the launches with ping pong balls in four more identical catapults. They compared the distances the balls traveled when launched from a 45° angle to the distances the balls traveled when launched from a 30° angle.\\nFigure: a catapult for launching ping pong balls.\\nIdentify the question that Tom and Justin's experiment can best answer.\\nA. Do ping pong balls stop rolling along the ground sooner after being launched from a 30° angle or a 45° angle?\\nB. Do ping pong balls travel farther when launched from a 30° angle compared to a 45° angle?\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'B'}],\n",
       "  'images': ['ScienceQA/train/2/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nContext: The passage below describes an experiment. Read the passage and then follow the instructions below.\\n\\nKathleen applied a thin layer of wax to the underside of her snowboard and rode the board straight down a hill. Then, she removed the wax and rode the snowboard straight down the hill again. She repeated the rides four more times, alternating whether she rode with a thin layer of wax on the board or not. Her friend Bryant timed each ride. Kathleen and Bryant calculated the average time it took to slide straight down the hill on the snowboard with wax compared to the average time on the snowboard without wax.\\nFigure: snowboarding down a hill.\\nIdentify the question that Kathleen and Bryant's experiment can best answer.\\nA. Does Kathleen's snowboard slide down a hill in less time when it has a layer of wax or when it does not have a layer of wax?\\nB. Does Kathleen's snowboard slide down a hill in less time when it has a thin layer of wax or a thick layer of wax?\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'A'}],\n",
       "  'images': ['ScienceQA/train/3/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nContext: This passage describes the myotonia congenita trait in goats:\\nMyotonia congenita is a condition that causes temporary muscle stiffness. When goats with myotonia congenita attempt to run from a resting position, their leg muscles often stiffen, causing them to fall over. Because of this behavior, these goats are referred to as fainting goats. Myotonia congenita is also found in other mammals, including horses, cats, and humans.\\nIn a group of goats, some individuals have myotonia congenita and others do not. In this group, the gene for the myotonia congenita trait has two alleles. The allele for having myotonia congenita (M) is dominant over the allele for not having myotonia congenita (m).\\nThis Punnett square shows a cross between two goats.\\nWhat is the probability that a goat produced by this cross will be homozygous dominant for the myotonia congenita gene?\\nA. 1/4\\nB. 0/4\\nC. 4/4\\nD. 2/4\\nE. 3/4\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'A'}],\n",
       "  'images': ['ScienceQA/train/17/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nContext: The diagrams below show two pure samples of gas in identical closed, rigid containers. Each colored ball represents one gas particle. Both samples have the same number of particles.\\nCompare the average kinetic energies of the particles in each sample. Which sample has the higher temperature?\\nA. neither; the samples have the same temperature\\nB. sample A\\nC. sample B\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'C'}],\n",
       "  'images': ['ScienceQA/train/19/image.png']}]"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vqa[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nPeople can use the engineering-design process to develop solutions to problems. One step in the process is testing if a potential solution meets the requirements of the design.\\nThe passage below describes how the engineering-design process was used to test a solution to a problem. Read the passage. Then answer the question below.\\n\\nGordon was an aerospace engineer who was developing a parachute for a spacecraft that would land on Mars. He needed to add a vent at the center of the parachute so the spacecraft would land smoothly. However, the spacecraft would have to travel at a high speed before landing. If the vent was too big or too small, the parachute might swing wildly at this speed. The movement could damage the spacecraft.\\nSo, to help decide how big the vent should be, Gordon put a parachute with a 1 m vent in a wind tunnel. The wind tunnel made it seem like the parachute was moving at 200 km per hour. He observed the parachute to see how much it swung.\\nFigure: a spacecraft's parachute in a wind tunnel.\\nWhich of the following could Gordon's test show?\\nA. if the spacecraft was damaged when using a parachute with a 1 m vent going 200 km per hour\\nB. how steady a parachute with a 1 m vent was at 200 km per hour\\nC. whether a parachute with a 1 m vent would swing too much at 400 km per hour\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'B'}],\n",
       "  'images': ['ScienceQA/test/5/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nWhat is the name of the colony shown?\\nA. Maryland\\nB. New Hampshire\\nC. Rhode Island\\nD. Vermont\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'B'}],\n",
       "  'images': ['ScienceQA/test/11/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nBelow is a food web from a tundra ecosystem in Nunavut, a territory in Northern Canada.\\nA food web models how the matter eaten by organisms moves through an ecosystem. The arrows in a food web represent how matter moves between organisms in an ecosystem.\\nWhich of these organisms contains matter that was once part of the lichen?\\nA. bilberry\\nB. mushroom\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'B'}],\n",
       "  'images': ['ScienceQA/test/23/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nThis passage describes the fleece type trait in sheep:\\nThe fleece, or outer coat, of a sheep is often cut off and used to make yarn for fabrics and other textiles. Woolly fleeces, which have shorter hairs, are usually used for clothing and blankets. Hairy fleeces, which have longer hairs, are usually used for carpets.\\nIn a group of sheep, some individuals have a hairy fleece and others have a woolly fleece. In this group, the gene for the fleece type trait has two alleles. The allele for a hairy fleece (F) is dominant over the allele for a woolly fleece (f).\\nThis Punnett square shows a cross between two sheep.\\nWhat is the expected ratio of offspring with a woolly fleece to offspring with a hairy fleece? Choose the most likely ratio.\\nA. 0:4\\nB. 4:0\\nC. 2:2\\nD. 1:3\\nE. 3:1\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'B'}],\n",
       "  'images': ['ScienceQA/test/42/image.png']},\n",
       " {'messages': [{'role': 'user',\n",
       "    'content': \"<image>\\nSelect the best answer.\\nWhich property do these three objects have in common?\\nA. shiny\\nB. slippery\\nC. opaque\\nAnswer with the option's letter from the given choices directly.\"},\n",
       "   {'role': 'assistant', 'content': 'C'}],\n",
       "  'images': ['ScienceQA/test/46/image.png']}]"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vqa_test[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6218\n",
      "2017\n"
     ]
    }
   ],
   "source": [
    "print(len(vqa))\n",
    "print(len(vqa_test))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_factory",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
