{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "75be4b0b-08c6-47b0-bec0-51ec4d608c57",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from typing import List\n",
    "import pandas as pd\n",
    "import wikipediaapi\n",
    "from entities import NERVES, SCIENTIFIC_LAWS, CHEMICAL_ELEMENTS, MOUNTAINS, RIVER_SYSTEMS, NATIONAL_PARKS\n",
    "\n",
    "pd.set_option('display.max_colwidth', 200)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b26ad49e-772f-46b3-9fc4-a7f154ec4725",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def get_wiki_texts_from_list(entities: List[str]) -> List[str]:\n",
    "    wiki_wiki = wikipediaapi.Wikipedia(user_agent='MyProjectName (merlin@example.com)', language='en')\n",
    "    texts = []\n",
    "    for name in entities:\n",
    "        text = wiki_wiki.page(name).text\n",
    "        texts.append(text)\n",
    "    return texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e7768a2e-8bf8-408e-8a13-be8c4eccae8a",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>entity</th>\n",
       "      <th>wikipedia_text</th>\n",
       "      <th>length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Nasopalatine nerve</td>\n",
       "      <td>The nasopalatine nerve (also Scarpa's nerve or long sphenopalatine nerve) is a nerve of the head. It is a sensory branch of the maxillary nerve (CN V2) that passes through the pterygopalatine gang...</td>\n",
       "      <td>2239</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Deep branch of the radial nerve</td>\n",
       "      <td>The radial nerve divides into a superficial (sensory) and deep (motor) branch at the cubital fossa. The deep branch of the radial nerve winds to the back of the forearm around the lateral side of ...</td>\n",
       "      <td>2260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Lesser occipital nerve</td>\n",
       "      <td>The lesser occipital nerve (or small occipital nerve) is a cutaneous spinal nerve of the cervical plexus. It arises from second cervical (spinal) nerve (C2) (along with the greater occipital nerve...</td>\n",
       "      <td>2270</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Lateral cutaneous nerve of forearm</td>\n",
       "      <td>The lateral  cutaneous nerve of forearm (or lateral antebrachial cutaneous nerve) is a sensory nerve representing the continuation of the musculocutaneous nerve beyond the lateral edge of the tend...</td>\n",
       "      <td>2278</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Obturator nerve</td>\n",
       "      <td>The obturator nerve in human anatomy arises from the ventral divisions of the second, third, and fourth lumbar nerves in the lumbar plexus; the branch from the third is the largest, while that fro...</td>\n",
       "      <td>2278</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Manaslu</td>\n",
       "      <td>Manaslu (; Nepali: मनास्लु, also known as Kutang) is the eighth-highest mountain in the world at 8,163 metres (26,781 ft) above sea level. It is in the Mansiri Himal, part of the Nepalese Himalaya...</td>\n",
       "      <td>25526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Nanga Parbat</td>\n",
       "      <td>Nanga Parbat, known locally as Diamer, is the ninth-highest mountain on Earth with its summit at 8,126 m (26,660 ft) above sea level. Lying immediately southeast of the northernmost bend of the In...</td>\n",
       "      <td>26433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Kangchenjunga</td>\n",
       "      <td>Kangchenjunga is the third-highest mountain in the world. Its summit lies at 8,586 m (28,169 ft) in a section of the Himalayas, the Kangchenjunga Himal, which is bounded in the west by the Tamur R...</td>\n",
       "      <td>27016</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>398</th>\n",
       "      <td>mountain</td>\n",
       "      <td>K2</td>\n",
       "      <td>K2, also known as Mount Godwin-Austen, at 8,611 metres (28,251 ft) above sea level, is the second-highest mountain on Earth, after Mount Everest at 8,849 metres (29,032 ft). It lies in the Karakor...</td>\n",
       "      <td>31979</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>399</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Mount Everest</td>\n",
       "      <td>Mount Everest (known locally as Sagarmāthā in Nepal and Qomolangma in Tibet) is Earth's highest mountain above sea level. It lies in the Mahalangur Himal sub-range of the Himalayas and marks part ...</td>\n",
       "      <td>86978</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>400 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     category                              entity  \\\n",
       "0      nerves                  Nasopalatine nerve   \n",
       "1      nerves     Deep branch of the radial nerve   \n",
       "2      nerves              Lesser occipital nerve   \n",
       "3      nerves  Lateral cutaneous nerve of forearm   \n",
       "4      nerves                     Obturator nerve   \n",
       "..        ...                                 ...   \n",
       "395  mountain                             Manaslu   \n",
       "396  mountain                        Nanga Parbat   \n",
       "397  mountain                       Kangchenjunga   \n",
       "398  mountain                                  K2   \n",
       "399  mountain                       Mount Everest   \n",
       "\n",
       "                                                                                                                                                                                              wikipedia_text  \\\n",
       "0    The nasopalatine nerve (also Scarpa's nerve or long sphenopalatine nerve) is a nerve of the head. It is a sensory branch of the maxillary nerve (CN V2) that passes through the pterygopalatine gang...   \n",
       "1    The radial nerve divides into a superficial (sensory) and deep (motor) branch at the cubital fossa. The deep branch of the radial nerve winds to the back of the forearm around the lateral side of ...   \n",
       "2    The lesser occipital nerve (or small occipital nerve) is a cutaneous spinal nerve of the cervical plexus. It arises from second cervical (spinal) nerve (C2) (along with the greater occipital nerve...   \n",
       "3    The lateral  cutaneous nerve of forearm (or lateral antebrachial cutaneous nerve) is a sensory nerve representing the continuation of the musculocutaneous nerve beyond the lateral edge of the tend...   \n",
       "4    The obturator nerve in human anatomy arises from the ventral divisions of the second, third, and fourth lumbar nerves in the lumbar plexus; the branch from the third is the largest, while that fro...   \n",
       "..                                                                                                                                                                                                       ...   \n",
       "395  Manaslu (; Nepali: मनास्लु, also known as Kutang) is the eighth-highest mountain in the world at 8,163 metres (26,781 ft) above sea level. It is in the Mansiri Himal, part of the Nepalese Himalaya...   \n",
       "396  Nanga Parbat, known locally as Diamer, is the ninth-highest mountain on Earth with its summit at 8,126 m (26,660 ft) above sea level. Lying immediately southeast of the northernmost bend of the In...   \n",
       "397  Kangchenjunga is the third-highest mountain in the world. Its summit lies at 8,586 m (28,169 ft) in a section of the Himalayas, the Kangchenjunga Himal, which is bounded in the west by the Tamur R...   \n",
       "398  K2, also known as Mount Godwin-Austen, at 8,611 metres (28,251 ft) above sea level, is the second-highest mountain on Earth, after Mount Everest at 8,849 metres (29,032 ft). It lies in the Karakor...   \n",
       "399  Mount Everest (known locally as Sagarmāthā in Nepal and Qomolangma in Tibet) is Earth's highest mountain above sea level. It lies in the Mahalangur Himal sub-range of the Himalayas and marks part ...   \n",
       "\n",
       "     length  \n",
       "0      2239  \n",
       "1      2260  \n",
       "2      2270  \n",
       "3      2278  \n",
       "4      2278  \n",
       "..      ...  \n",
       "395   25526  \n",
       "396   26433  \n",
       "397   27016  \n",
       "398   31979  \n",
       "399   86978  \n",
       "\n",
       "[400 rows x 4 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame()\n",
    "for entities, category_name in zip([NERVES, SCIENTIFIC_LAWS, CHEMICAL_ELEMENTS, MOUNTAINS], [\"nerves\", \"scientific_law\", \"chemical_element\", \"mountain\"]):\n",
    "    texts = get_wiki_texts_from_list(entities)\n",
    "    tmp_df = pd.DataFrame({\"category\": [category_name] * len(entities) , \"entity\": entities, \"wikipedia_text\": texts, \"length\": [len(t) for t in texts]})\n",
    "    tmp_df = tmp_df.sort_values(by=[\"length\"]).tail(100)\n",
    "    df = pd.concat([df, tmp_df]).reset_index(drop=True)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "35a966bf-af38-47e0-aee6-80e1f8796e3a",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>entity</th>\n",
       "      <th>wikipedia_text</th>\n",
       "      <th>length</th>\n",
       "      <th>prompts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Nasopalatine nerve</td>\n",
       "      <td>The nasopalatine nerve (also Scarpa's nerve or long sphenopalatine nerve) is a nerve of the head. It is a sensory branch of the maxillary nerve (CN V2) that passes through the pterygopalatine gang...</td>\n",
       "      <td>2239</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the Nasopalatine nerve</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Deep branch of the radial nerve</td>\n",
       "      <td>The radial nerve divides into a superficial (sensory) and deep (motor) branch at the cubital fossa. The deep branch of the radial nerve winds to the back of the forearm around the lateral side of ...</td>\n",
       "      <td>2260</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the Deep branch of the radial nerve</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Lesser occipital nerve</td>\n",
       "      <td>The lesser occipital nerve (or small occipital nerve) is a cutaneous spinal nerve of the cervical plexus. It arises from second cervical (spinal) nerve (C2) (along with the greater occipital nerve...</td>\n",
       "      <td>2270</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the Lesser occipital nerve</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Lateral cutaneous nerve of forearm</td>\n",
       "      <td>The lateral  cutaneous nerve of forearm (or lateral antebrachial cutaneous nerve) is a sensory nerve representing the continuation of the musculocutaneous nerve beyond the lateral edge of the tend...</td>\n",
       "      <td>2278</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the Lateral cutaneous nerve of forearm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>nerves</td>\n",
       "      <td>Obturator nerve</td>\n",
       "      <td>The obturator nerve in human anatomy arises from the ventral divisions of the second, third, and fourth lumbar nerves in the lumbar plexus; the branch from the third is the largest, while that fro...</td>\n",
       "      <td>2278</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the Obturator nerve</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Manaslu</td>\n",
       "      <td>Manaslu (; Nepali: मनास्लु, also known as Kutang) is the eighth-highest mountain in the world at 8,163 metres (26,781 ft) above sea level. It is in the Mansiri Himal, part of the Nepalese Himalaya...</td>\n",
       "      <td>25526</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the mountain Manaslu</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Nanga Parbat</td>\n",
       "      <td>Nanga Parbat, known locally as Diamer, is the ninth-highest mountain on Earth with its summit at 8,126 m (26,660 ft) above sea level. Lying immediately southeast of the northernmost bend of the In...</td>\n",
       "      <td>26433</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the mountain Nanga Parbat</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Kangchenjunga</td>\n",
       "      <td>Kangchenjunga is the third-highest mountain in the world. Its summit lies at 8,586 m (28,169 ft) in a section of the Himalayas, the Kangchenjunga Himal, which is bounded in the west by the Tamur R...</td>\n",
       "      <td>27016</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the mountain Kangchenjunga</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>398</th>\n",
       "      <td>mountain</td>\n",
       "      <td>K2</td>\n",
       "      <td>K2, also known as Mount Godwin-Austen, at 8,611 metres (28,251 ft) above sea level, is the second-highest mountain on Earth, after Mount Everest at 8,849 metres (29,032 ft). It lies in the Karakor...</td>\n",
       "      <td>31979</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the mountain K2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>399</th>\n",
       "      <td>mountain</td>\n",
       "      <td>Mount Everest</td>\n",
       "      <td>Mount Everest (known locally as Sagarmāthā in Nepal and Qomolangma in Tibet) is Earth's highest mountain above sea level. It lies in the Mahalangur Himal sub-range of the Himalayas and marks part ...</td>\n",
       "      <td>86978</td>\n",
       "      <td>Provide me with a paragraph detailing some facts related to the mountain Mount Everest</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>400 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     category                              entity  \\\n",
       "0      nerves                  Nasopalatine nerve   \n",
       "1      nerves     Deep branch of the radial nerve   \n",
       "2      nerves              Lesser occipital nerve   \n",
       "3      nerves  Lateral cutaneous nerve of forearm   \n",
       "4      nerves                     Obturator nerve   \n",
       "..        ...                                 ...   \n",
       "395  mountain                             Manaslu   \n",
       "396  mountain                        Nanga Parbat   \n",
       "397  mountain                       Kangchenjunga   \n",
       "398  mountain                                  K2   \n",
       "399  mountain                       Mount Everest   \n",
       "\n",
       "                                                                                                                                                                                              wikipedia_text  \\\n",
       "0    The nasopalatine nerve (also Scarpa's nerve or long sphenopalatine nerve) is a nerve of the head. It is a sensory branch of the maxillary nerve (CN V2) that passes through the pterygopalatine gang...   \n",
       "1    The radial nerve divides into a superficial (sensory) and deep (motor) branch at the cubital fossa. The deep branch of the radial nerve winds to the back of the forearm around the lateral side of ...   \n",
       "2    The lesser occipital nerve (or small occipital nerve) is a cutaneous spinal nerve of the cervical plexus. It arises from second cervical (spinal) nerve (C2) (along with the greater occipital nerve...   \n",
       "3    The lateral  cutaneous nerve of forearm (or lateral antebrachial cutaneous nerve) is a sensory nerve representing the continuation of the musculocutaneous nerve beyond the lateral edge of the tend...   \n",
       "4    The obturator nerve in human anatomy arises from the ventral divisions of the second, third, and fourth lumbar nerves in the lumbar plexus; the branch from the third is the largest, while that fro...   \n",
       "..                                                                                                                                                                                                       ...   \n",
       "395  Manaslu (; Nepali: मनास्लु, also known as Kutang) is the eighth-highest mountain in the world at 8,163 metres (26,781 ft) above sea level. It is in the Mansiri Himal, part of the Nepalese Himalaya...   \n",
       "396  Nanga Parbat, known locally as Diamer, is the ninth-highest mountain on Earth with its summit at 8,126 m (26,660 ft) above sea level. Lying immediately southeast of the northernmost bend of the In...   \n",
       "397  Kangchenjunga is the third-highest mountain in the world. Its summit lies at 8,586 m (28,169 ft) in a section of the Himalayas, the Kangchenjunga Himal, which is bounded in the west by the Tamur R...   \n",
       "398  K2, also known as Mount Godwin-Austen, at 8,611 metres (28,251 ft) above sea level, is the second-highest mountain on Earth, after Mount Everest at 8,849 metres (29,032 ft). It lies in the Karakor...   \n",
       "399  Mount Everest (known locally as Sagarmāthā in Nepal and Qomolangma in Tibet) is Earth's highest mountain above sea level. It lies in the Mahalangur Himal sub-range of the Himalayas and marks part ...   \n",
       "\n",
       "     length  \\\n",
       "0      2239   \n",
       "1      2260   \n",
       "2      2270   \n",
       "3      2278   \n",
       "4      2278   \n",
       "..      ...   \n",
       "395   25526   \n",
       "396   26433   \n",
       "397   27016   \n",
       "398   31979   \n",
       "399   86978   \n",
       "\n",
       "                                                                                                prompts  \n",
       "0                    Provide me with a paragraph detailing some facts related to the Nasopalatine nerve  \n",
       "1       Provide me with a paragraph detailing some facts related to the Deep branch of the radial nerve  \n",
       "2                Provide me with a paragraph detailing some facts related to the Lesser occipital nerve  \n",
       "3    Provide me with a paragraph detailing some facts related to the Lateral cutaneous nerve of forearm  \n",
       "4                       Provide me with a paragraph detailing some facts related to the Obturator nerve  \n",
       "..                                                                                                  ...  \n",
       "395                    Provide me with a paragraph detailing some facts related to the mountain Manaslu  \n",
       "396               Provide me with a paragraph detailing some facts related to the mountain Nanga Parbat  \n",
       "397              Provide me with a paragraph detailing some facts related to the mountain Kangchenjunga  \n",
       "398                         Provide me with a paragraph detailing some facts related to the mountain K2  \n",
       "399              Provide me with a paragraph detailing some facts related to the mountain Mount Everest  \n",
       "\n",
       "[400 rows x 5 columns]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start = \"Provide me with a paragraph detailing some facts related to \"\n",
    "prompts = [start + \"the \"] * 100 + [start] * 100 + [start + \"the chemical element \"] * 100 + [start + \"the mountain \"] * 100\n",
    "df[\"prompts\"] = [p + df.entity[i] for i, p in enumerate(prompts)]\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "15bbe1ff-5c0e-499e-9f6b-4d252ed45d4c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df.to_parquet(\"../multi_factscore.parquet\")"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "kernel": "uqlm",
   "name": "workbench-notebooks.m126",
   "type": "gcloud",
   "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126"
  },
  "kernelspec": {
   "display_name": "uqlm",
   "language": "python",
   "name": "uqlm"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
